This repository contains ready-to-use, pretrained Romansh language identification (LID) models together with the implementation of the underlying LID system presented in the Robust Language Identification for Romansh Varieties paper.
- Quickstart Guide to Using Pretrained Romansh LID Models
- About Romansh
- About the Raw Data
- About the Scripts in this Repo
Inside the LID_models folder, there are three different Romansh LID models resulting from the different experiments run on the Romansh data:
svm_char_word_train_dev: a linear SVM model trained on the combined training set and development set, where named entities were masked.svm_char_word: a linear SVM model trained on the training set only, where named entities were masked.svm_char_word_train_unmasked: a linear SVM model trained on the training set only, where named entities were not masked.
Even though more models were produced throughout the experiments, these models were the ones discussed in detail in the accompanying article and allow for the reproduction of the evaluations made. In terms of performance, all three models achieved similar accuracy and F1 scores across the different test sets; please see the opt_smv.py script to reproduce these results.
To use one of the Romansh LID models in the LID_models folder in Python, you will need the joblib and the scikit-learn modules. The following code snippets demonstrate how to import the modules, load a desired Romansh LID model, and how to interact with it.
To clone this repository and navigate into it, run the following in the terminal:
git clone https://github.com/ZurichNLP/romansh-lid.git && cd romansh-lid
To install joblib and the correct version of scikit-learn, run the following in the terminal:
pip install --upgrade pip
pip install scikit-learn==1.7.1
pip install joblib
Alternatively, you could create a virtual environment using python and install the dependencies listed in the requirements.txt file.
To import the load function needed to unpack a saved LID model, include the following line at the top of your Python file:
from joblib import load
To load a desired Romansh LID model from the LID_models folder (in this example we are using the svm_char_word model) and store it in a variable, use the following:
rm_lid_model = load("LID_models/svm_char_word.joblib")
The variable rm_lid_model now contains the Romansh LID model trained on the masked training data and is an instance of scikit-learn's Pipeline class. As such, it supports the methods defined in the Pipeline class.
For example, we can use the model's predict method to predict the variety of any given string in Romansh. The following example tests the sentence "I dream of holidays by the sea." in the different Romansh varieties:
# "I dream of holidays by the sea."
print(rm_lid_model.predict(["Ia ma semtg da vacanzas sper la mar."])[0]) # rm-surmiran
print(rm_lid_model.predict(["Jau siemiel da vacanzas a la mar."])[0]) # rm-rumgr
print(rm_lid_model.predict(["Jeu siemiel da vacanzas sper la mar."])[0]) # rm-sursilv
print(rm_lid_model.predict(["Jou sasiemgn da vacànzas a la mar."])[0]) # rm-sutsilv
print(rm_lid_model.predict(["Eau insömg da vacanzas al mer."])[0]) # rm-puter
print(rm_lid_model.predict(["Eu insömg da vacanzas al mar."])[0]) # rm-vallader
Running this script outputs the following to the terminal:
rm-surmiran
rm-rumgr
rm-sursilv
rm-sutsilv
rm-puter
rm-vallader
See scikit-learn's documentation of the Pipeline class to learn more about the available methods and their parameters.
test_usage.py contains a ready-to-run script comprised of the lines of code described above.
The term “Romansh” refers to a collection of closely-related linguistic varieties of Rhaetian descent native to the canton of Grisons in Switzerland spoken by approximately 40,000 people. These varieties, known as "idioms", comprise five historically distinct forms: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader, as well as Rumantsch Grischun, a standardized written form introduced in 1982 to facilitate official communication.
More information about Romansh can be found here and here.
Romansh data of varying public availability from five different sources was used in the training and evaluation of the Romansh LID models. Please see an overview of the different data sources and their availabilities below:
A dictionary of Romansh that aims to compile the vocabulary across all Romansh varieties (incl. Rumantsch Grischun) along with German translations into a single interface. The dictionary and its data is openly available on the Internet here.
A daily, subscription-based Romansh newspaper that publishes articles in different varieties (incl. Rumantsch Grischun). We exported content from variety-annotated WordPress dumps published from 2021 to 2025, which have been made accessible here for research purposes.
RTR is the Romansh broadcasting outlet of the Swiss broadcasting confederation Schweizerische Radio- und Fernsehgesellschaft (SRG). As a private association creating publicly financed media content, the SRG has been legally mandated to provide access to its archives since 2016 and does so by means of different API endpoints to its various outlet and media branches. We used the RTR Linguistic endpoint described here, which provides transcripts containing validated speech transcripts per variety (incl. Rumantsch Grischun).
The Telesguard Notes derive from RTR's television broadcast “Telesguard”, which delivers news surrounding the Rumantsch community and the greater Switzerland area from Monday to Friday. The data contains journalists’ pre-broadcast notes in their native idioms (no Rumantsch Grischun). As of this writing (01.2026), this data has not been made publicly available yet.
The Mediomatix textbooks contain parallel scholastic materials per idiom (excl. Rumantsch Grischun), used by the schools in Romansh communities to teach their respective idiom. This data has been made available for research purporses here.
This script defines and runs the pipeline that parses the raw data from all five data sources and goes through various datasource-specific preprocessing steps in order to split the data into training, validation and test sets.
This folder contains the scripts for the parsing of the raw data from each data source.
This file contains the code that runs the inter-class exact duplicates analysis and the inter-class near duplicates analysis.
This file contains the code that defines the training, development and test set splitting. This includes running named entity recognition.
This file contains the code for the named entity recognition.
This file contains applies cleaning measures relevant for the entire training data.
This file contains preliminary experiments that were run to compare the initial performance of different methods.
This file contains experiments involving linear SVM models and logistic regression models.
This file contains the code that was used to train and evaluate the final Romansh LID models.
This work is based on a Bachelor's thesis presented to the University of Zurich. We thank Lia Rumantscha and RTR for their support and for facilitating access to the data sources used in this study, including Pledari Grond, La Quotidiana, and RTR materials. We also thank Uniun dals Grischs for making dictionary data for Puter and Vallader available to us for research use. Their commitment to preserving and promoting the Romansh language made this research possible.
@misc{model-et-al-2026-robust,
title={Robust Language Identification for Romansh Varieties},
author={Charlotte Model and Sina Ahmadi and Jannis Vamvas},
year={2026},
eprint={2603.15969},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.15969},
}