SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper
•
2308.11466
•
Published
•
1
Original weights of SONAR converted for hugging face model.
This is a part of Open SONAR project, an open training pipeline for SONAR. This pipeline could be use to finetune or train from scratch a sonar model.
Examples are avalaible here.
SONARForSpeech2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
NllbTokenizer.from_pretrained("tutur90/SONAR-Text-to-Text")
The code of SONARForSpeech2Text avalaible in Open SONAR - Model and NllbTokenizer Open SONAR - Tokenizer
inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")
generated = sonar.generate(
**inputs,
target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
max_length=128,
num_beams=1,
do_sample=False,
)
decoded = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print(f"SONAR {src_lang} -> {tgt_lang}: {decoded}")
encoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
encoder.set_encoder_only() # Delete decoder to save memory, this options is not needed
inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")
embeddings = encoder.encode(**inputs)
decoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
decoder.set_decoder_only() # Same
decoded = decoder.decode(
encoder_outputs,
target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
max_length=128,
num_beams=1,
do_sample=False,
)
decoded = tokenizer.batch_decode(decoded, skip_special_tokens=True)[0]