# Voice-Controlled Robot Arm

A full-stack embodied AI system — voice in, physical action out — running entirely offline on consumer hardware.

[中文](README.md)

---

## Overview

| Layer | Implementation |
|:---|:---|
| **Hear** | Faster-Whisper, local Chinese speech recognition |
| **Think** | DeepSeek-R1-1.5B + QLoRA fine-tune, natural language → JSON |
| **See** | YOLOv8s object detection + homography hand-eye calibration |
| **Move** | D-H inverse kinematics + S-Curve trajectory, ESP32 PWM |

Total hardware cost **¥317 (~$45 USD)**. Requires an NVIDIA GPU for LLM inference (RTX 3060 6GB recommended, <4GB VRAM at runtime, <200ms latency).

---

## Architecture

```mermaid
flowchart TD
    MIC["🎤 Microphone"] --> STT["Faster-Whisper<br/>Chinese speech recognition"]
    STT --> RULE{"Regex engine<br/>Simple command match"}
    RULE -- "Hit" --> ACT["JSON action"]
    RULE -- "Miss (has object name)" --> LLM["DeepSeek-R1-1.5B<br/>QLoRA FP16<br/>Natural language → JSON"]
    LLM --> ACT
    ACT --> VIS["YOLOv8s + Homography<br/>Object detection · hand-eye calibration<br/>Pixel coords → robot coords mm"]
    VIS --> MOT["arm_main.py<br/>D-H IK + S-Curve"]
    MOT --> ESP["ESP32 PWM → Servos"]
```

---

## Bill of Materials

Total: **¥317 (~$45 USD)**

| # | Item | Spec | Qty | Unit | Total |
|:--|:---|:---|:--:|---:|---:|
| 1 | 3D-printed robot arm kit | Acrylic/PLA structural parts | 1 | ¥71 | ¥71 |
| 2 | ESP32 dev board | Dual-core MCU, WiFi + BT | 1 | ¥19 | ¥19 |
| 3 | ESP32 accessories | Connectors / expansion board | 1 | ¥5 | ¥5 |
| 4 | USB industrial camera | Plug-and-play, wide-angle, 1280×720 | 1 | ¥61 | ¥61 |
| 5 | Digital servo MG996R | Metal gear, high torque | 5 | ¥27 | ¥133 |
| 6 | Regulated power supply | 6V 6A, servo-dedicated | 1 | ¥29 | ¥29 |

**Wiring**

- **ESP32 pins**: X→14, Y→4, Z→5, B→18, Gripper→23
- **Power**: servos and ESP32 on separate supplies (external 6V/6A) to prevent inrush surge
- **Camera**: USB, mounted in front of the arm covering the full work surface
- **Serial**: USB to ESP32, default port `COM3`, override with `ROBOT_PORT` env var

---

## Installation

### 1. Flash Firmware

Arduino IDE 2.x, board: "ESP32 Dev Module". Open `main.ino`, select the correct port, click Upload.

### 2. Python Environment

Python 3.10+, CUDA 11.8 or 12.x.

```bash
# Install the correct CUDA build of PyTorch from pytorch.org first, then:
pip install -r requirements.txt
```

### 3. Configure

All tunables are in `config.py` and support environment variable overrides — no code changes needed:

```bash
ROBOT_PORT=COM5               python voice_main.py  # change serial port
LLM_MODEL_PATH=D:\models\lora python voice_main.py  # change LLM path
YOLO_MODEL_PATH=runs/best.pt  python voice_main.py  # change YOLO path
```

### 4. Models

**Speech (Whisper)**: the `base` model is downloaded automatically on first run.

**Vision (YOLO)**: train your own detector — 50 labelled images is enough for transfer learning:

```bash
yolo detect train model=yolov8s.pt data=data.yaml epochs=100 imgsz=640
# Output: runs/detect/train/weights/best.pt → copy to project root
```

**LLM**: fine-tune DeepSeek-R1-1.5B or Qwen1.5-1.8B with QLoRA. See [`TRAINING.md`](TRAINING.md) for the complete guide.

Training data format (Alpaca):
```json
{
  "instruction": "lift the pencil sharpener 5cm",
  "input": "",
  "system": "You are a robot arm JSON converter...",
  "output": "[{\"action\": \"lift\", \"target\": \"part\", \"height\": 50}]"
}
```

---

## Quick Start

```bash
python voice_main.py
```

On startup the system loads in order: serial port → YOLO → Whisper → LLM → camera window.

**Keyboard Shortcuts**

| Key | Function |
|:---|:---|
| **SPACE (hold)** | Record audio; release to transcribe and execute |
| **C** | Toggle hand-eye calibration mode |
| **R** | Manual reset to home position |
| **O** | Force open gripper |
| **Q** | Quit |

---

## Voice Commands

Speak natural Chinese. No special syntax required.

**Pick and transport (requires visual detection)**
```
"把削笔刀抓起来"   — pick up the pencil sharpener
"抓住那个盒子"     — grab that box
"把削笔刀抬起5厘米" — lift the pencil sharpener 5cm
"将零件举高10公分"  — raise the part 10cm
```

**Precise directional movement**
```
"向上三厘米"      → Z +30mm
"向左移动四毫米"   → Y +4mm
"往前伸10厘米"    → X +100mm
```

**Fuzzy movement** (no explicit distance, defaults to 5cm per `config.DEFAULT_MOVE_MM`)
```
"向左"  "抬起"  "往下"
```

**Gestures and state commands**
```
"点头"  — nod: oscillate Z ×3 (±3cm)
"摇头"  — shake head: oscillate Y ×3 (±3cm)
"放下"  — lower to table height (Z=-15mm) and release
"复位"  — return to home position [120, 0, 60] mm
"松开"  — open gripper without moving
```

**Speech compatibility**: built-in homophone correction for common Whisper mishearings, e.g. `"零米"→"厘米"`, `"小笔刀"→"削笔刀"`, `"电头"→"点头"`.

---

## Hand-Eye Calibration

Recalibrate whenever the camera is moved. Press **C** to enter calibration mode, then click 4 corner points in order:

```
P1 (top-left)     ↔  robot coords (90,  90)
P2 (top-right)    ↔  robot coords (200,  90)
P3 (bottom-right) ↔  robot coords (200, -90)
P4 (bottom-left)  ↔  robot coords (90, -90)
```

The homography matrix updates instantly after the 4th click. No restart needed.

---

## Troubleshooting

| Symptom | Cause | Fix |
|:---|:---|:---|
| SPACE does nothing | Camera window not focused | Click the camera window first |
| Garbled recognition | Mic noise / speaking too fast | Quiet environment, moderate pace; hold SPACE 0.5s before speaking |
| "Target not found" | YOLO didn't detect the object | Adjust lighting/angle; verify object is in training classes |
| Pick position offset | Camera was moved | Press **C** and redo 4-point calibration |
| Serial connection failed | ESP32 not plugged in / wrong port | Check device manager; set `ROBOT_PORT` env var |
| Violent shaking on startup | 5-servo simultaneous inrush | Firmware staggers power-on; if it persists, check PSU capacity |

---

## Technical Notes

Key engineering problems solved during development.

**D-H Inverse Kinematics**
The 130mm L4 link causes ~40° path deviation with geometric IK during horizontal moves. Solved by Scipy SLSQP numerical optimization with a `Pitch=-90°` constraint (end-effector always perpendicular to the table), eliminating the nonlinear offset entirely.

**S-Curve + Multi-Layer Damping**
MG996R servos vibrate badly under a long lever arm. Five-layer damping pipeline: tilt correction → moving-average filter (deque) → speed cap → EMA damping → dead-zone filter.

**Dual-Channel Parse Architecture**
Simple commands (release/reset/directional moves) bypass the LLM entirely via a regex engine (microseconds). Only complex commands containing object names reach the LLM (<200ms). This prevents the common failure mode where "move down 3cm" gets misclassified as a `lift` action.

**Pre-filling to Skip Chain-of-Thought**
DeepSeek-R1 outputs a `<think>...</think>` chain-of-thought by default. Appending `<｜Assistant｜>` as a pre-fill token forces the model to skip the thinking phase and emit JSON directly, achieving 100% format compliance.

**Whisper Anti-Hallucination**
Three defences, all encapsulated in `RobotEar.get_text()`: silence trimming + duration guards; `condition_on_previous_text=False`; repeated-phrase regex dedup (removes "向右向右向右..." loops). All thresholds are tunable via `config.py`.

**Engineering Pitfall: System Prompt Alignment**
The system prompt at inference must exactly match the one used during fine-tuning. Any mismatch causes output drift (e.g., outputting 500mm instead of 50mm). A warning comment is included in the source.

---

## LLM Training

~500 domain-specific samples, QLoRA fine-tune of DeepSeek-R1-1.5B, loss converged to 0.0519, format error rate 0%.

See [`TRAINING.md`](TRAINING.md) for the full guide: QLoRA hyperparameter config, GGUF vs Transformers comparison, pre-filling inference details, and experiment results.

---

## Project Structure

```
robot_arm/
├── README.md          Chinese documentation
├── README_EN.md       This file
├── TRAINING.md        LLM LoRA fine-tuning research notes
├── requirements.txt   Python dependencies
├── config.py          All tunables: hardware, motion, audio & gesture constants
│
├── main.ino           ESP32 firmware, LEDC PWM servo control
├── arm_main.py        Kinematics core: D-H IK + S-Curve trajectory
├── whisper_main.py    Full ASR pipeline: silence trim → transcribe → post-process
└── voice_main.py      Main app: voice → LLM → vision → motion
```

---

## Key Specs

| Metric | Value |
|:---|:---|
| Hardware cost | ¥317 (~$45 USD) |
| GPU requirement | RTX 3060 6GB (<4GB VRAM at runtime) |
| Inference latency | <200ms (LLM), <50ms (rule engine) |
| Training samples | ~500 |
| Format error rate | 0% |
| Operation mode | Fully offline |