Universal Language Contribution API
About ULCA

ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models.

About ULCA

Application Programming Interfaces

Data Sets  Language datasets
  • Parallel text corpus in two or more languages
  • Monolingual text corpus
  • Automatic Speech Recognition (ASR) corpus
  • Text to Speech (TTS) corpus
  • Optical Character Recognition (OCR) corpus
  • Natural Language Understanding (NLU) datasets
Models  Language specific tasks
  • Machine Translation (MT)
  • Automatic Speech Recognition (ASR)
  • Text to Speech (TTS)
  • Optical Character Recognition (OCR)
Benchmarks  Open benchmarking
  • Large, diverse and task specific benchmarks
  • Research community approved metric system
shapes
shapes

Why ULCA?

  • Be the premier data repository for Indian language resources
  • Collect datasets for MT, ASR, TTS, OCR and various NLP tasks in standardized but extensible formats
  • Collect extensive metadata related to dataset for various analysis
  • Proper attribution for every contributor at the record level
  • Deduplication capability built-in
  • Simple interface to search and download datasets based on various filters
  • Perform various quality checks on the submitted datasets
  • Trained models for language specific tasks
  • Multiple benchmarks defined for each model task
  • Human validated Benchmark datasets
  • Create and submit new benchmark metrics for any model task