Gujarati Legal Document Corpus and NLP Pipeline for Court Case Understanding

Ongoing

Building a machine learning-ready Gujarati legal corpus from Gujarat eCourts data, with NLP pipelines for classification, named entity recognition, summarization, and retrieval-augmented QA over district court judgments.

NLPOCRTransformersIndicBERTMuRILPythonLow-resource NLPRAG

Overview

This project constructs a low-resource Gujarati legal NLP pipeline using scraped Gujarat eCourt PDFs. The core challenge is that court documents use inconsistent encodings — valid Unicode, legacy fonts (LMG-Arun, TERAFONT-VARUN), and corrupted ToUnicode mappings — so the pipeline evaluates multiple text extraction strategies before any model training.

Once text is extracted and normalized, each case is converted into a structured JSONL record with metadata (district, court, year, CNR number, section) and task-specific labels. Supervised fine-tuning targets include legal text classification, named entity recognition, and legal summarization.

The retrieval layer uses multilingual sentence embeddings (LaBSE, BGE-M3, multilingual-e5) to enable semantic search and RAG over the corpus. The ML contribution is studying how OCR noise propagates into downstream NLP model performance on real district court data.

Document Extraction and Normalization

Gujarat court PDFs are highly inconsistent: some contain valid Unicode Gujarati text, some use legacy fonts (LMG-Arun, TERAFONT-VARUN), and others have corrupted or missing ToUnicode mappings. The pipeline evaluates multiple extraction strategies before model training: direct PDF parsing, legacy font conversion, Tesseract Gujarati OCR, SuryaOCR, Google Cloud Vision, IronOCR, and deterministic glyph-to-Unicode mapping. Extracted text is cleaned and stored with metadata including district, court, year, legal section, case number, CNR number, extraction method, and error flags. Known OCR errors include Gujarati digit ૨ being confused with ર, broken conjuncts, wrong vowel marks, and noisy stamp/signature hallucinations.

Dataset Construction

Each scraped case is converted into a structured JSONL record containing raw judgment text, case metadata, section information, and document-level labels. For bail-related cases, labels include bail granted/rejected/pending, legal section, offense type, applicant/respondent details, court name, date, advocate names, judge names, and cited statutes. This creates a task-specific Gujarati legal dataset usable for both classification and extraction.

Baseline OCR and Text-Quality Evaluation

Before training legal NLP models, text extraction quality is compared across tools. Tesseract serves as the baseline but produces major errors in names, dates, amounts, conjuncts, and legal phrases. SuryaOCR is a stronger open-source alternative, while Google Cloud Vision and IronOCR offer higher accuracy at cost. Legacy-font conversion is critical because many PDFs visually render Gujarati correctly but store glyph IDs internally. OCR quality directly gates downstream NLP performance.

Gujarati Legal Text Classification

Transformer models (multilingual BERT, XLM-RoBERTa, IndicBERT, MuRIL) are fine-tuned on the cleaned corpus for document-level classification tasks:

Bail order vs. domestic violence order vs. criminal procedure order classification
Legal section classification — CrPC 436, 437, 438, 439
Case outcome prediction — bail granted or rejected
Procedural vs. factual vs. final judgment content detection
Document quality classification — clean, noisy, or unusable extraction

Legal Named Entity Recognition

Sequence labeling models extract structured legal entities from Gujarati court text. Target entities include petitioner/applicant names, respondent names, advocate and judge names, court names, dates, monetary amounts, sections, acts, police station names, FIR numbers, and final order outcomes. Models include IndicBERT, MuRIL, and XLM-R fine-tuned for token classification, with a rule-based regex baseline for structured fields like dates and case numbers.

Legal Summarization

Summarization models (mT5, IndicBART, mBART, ByT5, and instruction-tuned multilingual LLMs) convert long Gujarati court orders into concise structured summaries. Target summaries cover case background, legal section, key allegations, court reasoning, and final decision. Summaries can be produced in Gujarati or English depending on the downstream use case.

Retrieval and Legal Question Answering

Multilingual sentence embeddings (LaBSE, multilingual-e5, BGE-M3, XLM-R) index the cleaned corpus for semantic search. A retrieval-augmented generation system supports natural language queries such as:

Which cases mention anticipatory bail under CrPC 438?
Which court orders granted bail in a specific district and year?
What were the common reasons for bail rejection?
Which documents mention a specific legal provision or police station?
What are the facts and outcome of a given CNR number?

LLM-Based Structured Information Extraction

Larger instruction-tuned models (LLaMA, Qwen, Gemma) extract structured JSON from noisy legal text — fields like case parties, court, judge, act, section, outcome, reasoning, and cited provisions. The project compares zero-shot prompting, few-shot prompting, LoRA fine-tuning, and OCR-corrected input pipelines, since general-purpose VLMs are weak on low-resource Gujarati legal text.

ML Workflow

1
Scrape Gujarat eCourts metadata and PDFs.
2
Extract text using direct PDF parsing, OCR, or legacy font conversion.
3
Normalize Gujarati text and remove extraction noise.
4
Build JSONL records with metadata, text, labels, and extraction-quality tags.
5
Train baseline models using TF-IDF + Logistic Regression or SVM.
6
Fine-tune transformer models (IndicBERT, MuRIL, XLM-R, mT5, mBART).
7
Compare results across extraction methods to measure OCR impact on NLP performance.
8
Build retrieval indexes using multilingual embedding models.
9
Evaluate classification, extraction, summarization, and retrieval performance.
10
Deploy the best-performing pipeline for Gujarati legal analytics and court case understanding.