# What You Need to Train Qwen and Mistral on Evaluations This document spells out **consent**, **export flow**, **data formats**, **scripts**, and **environment** so you can train both the text-based model (Mistral) and the video-based model (Qwen) on your SpeechGradebook evaluations. --- ## 1. Consent (required for export) The app only includes evaluations in LLM training exports when **both** of the following are satisfied: | Who | What | Where in the app / DB | |-----|------|------------------------| | **Student** | Data-use consent for the course | `consent_forms`: `consent_type = 'data_collection'`, `consent_given = true` (same consent used for grading, platform improvement, research). | | **Instructor** | LLM training consent | `user_profiles`: `llm_training_consent_given = true` (instructor opts in via Settings or consent prompt). | - Export filters out evaluations from courses where the student has not given data-use consent and from instructors who have not given LLM training consent. - Demo-tier accounts are excluded from training exports. --- ## 2. Export flow (Super Admin only) - **Where:** Analytics → **Evaluations** (or the **LLM Export** tab). Super Admin only. - **Main export:** Click **Download for LLM training** → saves `exported.json` with consented evaluations in the format expected by the training pipeline. **Export types:** | Export | File / use | Purpose | |--------|------------|--------| | **Download for LLM training** | `exported.json` | Single format for both Mistral (after conversion) and Qwen. Contains transcript, rubric, scores, markers, optional video_url. | | **Convert to JSONL** | Run `node export_to_jsonl.js exported.json [--split 0.9]` | Produces `train.jsonl` (and optional `validation.jsonl`) for **Mistral**. | | **Qwen single-eval manifest** | From same export or **Train Qwen on ISAAC** flow | Build `train_qwen.jsonl` (one line per evaluation: video_path/URL, rubric, scores). | | **Correction pairs** | **Export correction pairs (download JSONL)** | `train_qwen_correction_pairs.jsonl`: AI vs instructor (scores_original, scores_final). | | **Comparison pairs** | **Export comparison pairs (download JSONL)** | `train_qwen_pairs.jsonl`: same student, two speeches (video_path_1, video_path_2, rubric, scores). | See **COMPARISON_AND_CORRECTIONS_TRAINING.md** for when to use correction vs comparison pairs. --- ## 3. Data formats ### Mistral (text tier) - **Input to convert:** `exported.json` (from “Download for LLM training”). - **After convert:** `train.jsonl` (and optionally `validation.jsonl`). Each line is one JSON object with a `messages` array (system / user / assistant) for instruction tuning: - **User:** rubric + transcript (and optional video_notes). - **Assistant:** JSON scores/comments in the same shape as `evaluation_data.sections`. See `llm_training/README.md` and `example_train.jsonl` for the exact structure. ### Qwen (video tier) - **Manifest:** `train_qwen.jsonl`. One JSON object per line: - `video_path` or `image_path`: URL or path to video/image. - `rubric`: full rubric object (name, categories, subcategories). - `scores`: target scores in the same shape the model should output (e.g. `{"Content": {"score": 35, "maxScore": 40, "subcategories": [...]}, "Delivery": {...}}`). Optional exports for advanced training: - **Correction pairs:** `train_qwen_correction_pairs.jsonl` — `video_path`, `rubric`, `scores_original`, `scores_final` (and optional `evaluation_id`). Use when you want the model to learn from instructor overrides. - **Comparison pairs:** `train_qwen_pairs.jsonl` — `video_path_1`, `video_path_2`, `rubric`, scores for both. Use for “same student, two speeches” (e.g. Persuasive 1 vs 2). Your training script must support these formats; the in-repo `train_qwen_vl.py` currently expects the single-eval manifest format and validates it (see Scripts below). --- ## 4. Scripts | Script | Model | Purpose | |--------|--------|--------| | `export_to_jsonl.js` | Mistral | Converts `exported.json` → `train.jsonl` (and optional `validation.jsonl`). Run: `node export_to_jsonl.js exported.json [--split 0.9]`. | | `train_lora.py` | Mistral | LoRA fine-tuning on Mistral 7B. Needs `train.jsonl` (and optionally `validation.jsonl`). Run: `python train_lora.py --train_file train.jsonl [--validation_file validation.jsonl] --output_dir ./mistral7b-speech-lora`. Use `--load_in_8bit` for ~10GB VRAM. | | `train_qwen_vl.py` | Qwen | Validates `train_qwen.jsonl` manifest; `--validate_only` checks format and paths. A full training loop (Dataset + Qwen2.5-VL + LoRA) is not yet implemented in this repo—see **DUAL_MODEL_TRAINING.md** for options (LLaMA-Factory or extend this script). | | `train_speechgradebook.slurm` | Mistral | ISAAC SLURM job for Mistral training. | | `train_qwen_speechgradebook.slurm` | Qwen | ISAAC SLURM job for Qwen; runs `train_qwen_vl.py` when manifest exists (e.g. validate or future training step). | **Python dependencies:** Install once (e.g. in a venv or conda env on ISAAC): ```bash pip install -r llm_training/requirements-train.txt ``` You also need a Hugging Face token for Mistral: `export HF_TOKEN=your_token`. --- ## 5. Environment (where to run training) | Option | Notes | |--------|--------| | **Modal (serverless GPU)** | Pay-per-use A100 GPU (~$4-5/hour). No setup required, just deploy training script. Good for Qwen training. See **QWEN_TRAINING_MODAL.md** for complete guide. | | **ISAAC (campus HPC)** | Recommended: free GPU access (e.g. `campus-gpu`, V100/A40). Use your account (e.g. `acf-utk0011`) and `--qos=campus-gpu`. Copy `llm_training/` and your export/manifest files, then `sbatch train_speechgradebook.slurm` or `train_qwen_speechgradebook.slurm`. See **ISAAC_SETUP.md**, **ISAAC_QUICK_FIXES.md**. | | **RunPod / Lambda Labs / other cloud GPU** | Same scripts; ensure enough VRAM (Mistral ~10–16 GB with 8-bit; Qwen-VL typically ≥24 GB). | | **Local machine** | Only if you have a suitable GPU; same commands, often with `--load_in_8bit` for Mistral. | --- ## Quick reference: Mistral vs Qwen | Step | Mistral (text) | Qwen (video) | |------|-----------------|--------------| | 1. Consent | Student `data_collection` + instructor `llm_training_consent_given` | Same | | 2. Export | Super Admin → **Download for LLM training** → `exported.json` | Same export; use video_url / paths for Qwen manifest | | 3. Convert | `node export_to_jsonl.js exported.json [--split 0.9]` → `train.jsonl` | Build `train_qwen.jsonl` (or use correction/comparison exports) | | 4. Train | `python train_lora.py --train_file train.jsonl --output_dir ./mistral7b-speech-lora` | `python train_qwen_vl.py --manifest train_qwen.jsonl --validate_only` (validation only; full training via LLaMA-Factory or extended script) | | 5. Serve | `serve_model.py` for text-tier API | Qwen service (e.g. `qwen_serve.py`) when you want video-tier evaluations | For more detail: **README.md**, **DUAL_MODEL_TRAINING.md**, **STEPS_TO_REAL_EVALUATIONS.md**, **COMPARISON_AND_CORRECTIONS_TRAINING.md**.