robot_arm/README_EN.md

254 lines
8.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Voice-Controlled Robot Arm
A full-stack embodied AI system — voice in, physical action out — running entirely offline on consumer hardware.
[中文](README.md)
---
## Overview
| Layer | Implementation |
|:---|:---|
| **Hear** | Faster-Whisper, local Chinese speech recognition |
| **Think** | DeepSeek-R1-1.5B + QLoRA fine-tune, natural language → JSON |
| **See** | YOLOv8s object detection + homography hand-eye calibration |
| **Move** | D-H inverse kinematics + S-Curve trajectory, ESP32 PWM |
Total hardware cost **¥317 (~$45 USD)**. Requires an NVIDIA GPU for LLM inference (RTX 3060 6GB recommended, <4GB VRAM at runtime, <200ms latency).
---
## Architecture
```mermaid
flowchart TD
MIC["🎤 Microphone"] --> STT["Faster-Whisper<br/>Chinese speech recognition"]
STT --> RULE{"Regex engine<br/>Simple command match"}
RULE -- "Hit" --> ACT["JSON action"]
RULE -- "Miss (has object name)" --> LLM["DeepSeek-R1-1.5B<br/>QLoRA FP16<br/>Natural language → JSON"]
LLM --> ACT
ACT --> VIS["YOLOv8s + Homography<br/>Object detection · hand-eye calibration<br/>Pixel coords → robot coords mm"]
VIS --> MOT["arm_main.py<br/>D-H IK + S-Curve"]
MOT --> ESP["ESP32 PWM → Servos"]
```
---
## Bill of Materials
Total: **¥317 (~$45 USD)**
| # | Item | Spec | Qty | Unit | Total |
|:--|:---|:---|:--:|---:|---:|
| 1 | 3D-printed robot arm kit | Acrylic/PLA structural parts | 1 | ¥71 | ¥71 |
| 2 | ESP32 dev board | Dual-core MCU, WiFi + BT | 1 | ¥19 | ¥19 |
| 3 | ESP32 accessories | Connectors / expansion board | 1 | ¥5 | ¥5 |
| 4 | USB industrial camera | Plug-and-play, wide-angle, 1280×720 | 1 | ¥61 | ¥61 |
| 5 | Digital servo MG996R | Metal gear, high torque | 5 | ¥27 | ¥133 |
| 6 | Regulated power supply | 6V 6A, servo-dedicated | 1 | ¥29 | ¥29 |
**Wiring**
- **ESP32 pins**: X14, Y4, Z5, B18, Gripper23
- **Power**: servos and ESP32 on separate supplies (external 6V/6A) to prevent inrush surge
- **Camera**: USB, mounted in front of the arm covering the full work surface
- **Serial**: USB to ESP32, default port `COM3`, override with `ROBOT_PORT` env var
---
## Installation
### 1. Flash Firmware
Arduino IDE 2.x, board: "ESP32 Dev Module". Open `main.ino`, select the correct port, click Upload.
### 2. Python Environment
Python 3.10+, CUDA 11.8 or 12.x.
```bash
# Install the correct CUDA build of PyTorch from pytorch.org first, then:
pip install -r requirements.txt
```
### 3. Configure
All tunables are in `config.py` and support environment variable overrides no code changes needed:
```bash
ROBOT_PORT=COM5 python voice_main.py # change serial port
LLM_MODEL_PATH=D:\models\lora python voice_main.py # change LLM path
YOLO_MODEL_PATH=runs/best.pt python voice_main.py # change YOLO path
```
### 4. Models
**Speech (Whisper)**: the `base` model is downloaded automatically on first run.
**Vision (YOLO)**: train your own detector 50 labelled images is enough for transfer learning:
```bash
yolo detect train model=yolov8s.pt data=data.yaml epochs=100 imgsz=640
# Output: runs/detect/train/weights/best.pt → copy to project root
```
**LLM**: fine-tune DeepSeek-R1-1.5B or Qwen1.5-1.8B with QLoRA. See [`TRAINING.md`](TRAINING.md) for the complete guide.
Training data format (Alpaca):
```json
{
"instruction": "lift the pencil sharpener 5cm",
"input": "",
"system": "You are a robot arm JSON converter...",
"output": "[{\"action\": \"lift\", \"target\": \"part\", \"height\": 50}]"
}
```
---
## Quick Start
```bash
python voice_main.py
```
On startup the system loads in order: serial port YOLO Whisper LLM camera window.
**Keyboard Shortcuts**
| Key | Function |
|:---|:---|
| **SPACE (hold)** | Record audio; release to transcribe and execute |
| **C** | Toggle hand-eye calibration mode |
| **R** | Manual reset to home position |
| **O** | Force open gripper |
| **Q** | Quit |
---
## Voice Commands
Speak natural Chinese. No special syntax required.
**Pick and transport (requires visual detection)**
```
"把削笔刀抓起来" — pick up the pencil sharpener
"抓住那个盒子" — grab that box
"把削笔刀抬起5厘米" — lift the pencil sharpener 5cm
"将零件举高10公分" — raise the part 10cm
```
**Precise directional movement**
```
"向上三厘米" → Z +30mm
"向左移动四毫米" → Y +4mm
"往前伸10厘米" → X +100mm
```
**Fuzzy movement** (no explicit distance, defaults to 5cm per `config.DEFAULT_MOVE_MM`)
```
"向左" "抬起" "往下"
```
**Gestures and state commands**
```
"点头" — nod: oscillate Z ×3 (±3cm)
"摇头" — shake head: oscillate Y ×3 (±3cm)
"放下" — lower to table height (Z=-15mm) and release
"复位" — return to home position [120, 0, 60] mm
"松开" — open gripper without moving
```
**Speech compatibility**: built-in homophone correction for common Whisper mishearings, e.g. `"零米"→"厘米"`, `"小笔刀"→"削笔刀"`, `"电头"→"点头"`.
---
## Hand-Eye Calibration
Recalibrate whenever the camera is moved. Press **C** to enter calibration mode, then click 4 corner points in order:
```
P1 (top-left) ↔ robot coords (90, 90)
P2 (top-right) ↔ robot coords (200, 90)
P3 (bottom-right) ↔ robot coords (200, -90)
P4 (bottom-left) ↔ robot coords (90, -90)
```
The homography matrix updates instantly after the 4th click. No restart needed.
---
## Troubleshooting
| Symptom | Cause | Fix |
|:---|:---|:---|
| SPACE does nothing | Camera window not focused | Click the camera window first |
| Garbled recognition | Mic noise / speaking too fast | Quiet environment, moderate pace; hold SPACE 0.5s before speaking |
| "Target not found" | YOLO didn't detect the object | Adjust lighting/angle; verify object is in training classes |
| Pick position offset | Camera was moved | Press **C** and redo 4-point calibration |
| Serial connection failed | ESP32 not plugged in / wrong port | Check device manager; set `ROBOT_PORT` env var |
| Violent shaking on startup | 5-servo simultaneous inrush | Firmware staggers power-on; if it persists, check PSU capacity |
---
## Technical Notes
Key engineering problems solved during development.
**D-H Inverse Kinematics**
The 130mm L4 link causes ~40° path deviation with geometric IK during horizontal moves. Solved by Scipy SLSQP numerical optimization with a `Pitch=-90°` constraint (end-effector always perpendicular to the table), eliminating the nonlinear offset entirely.
**S-Curve + Multi-Layer Damping**
MG996R servos vibrate badly under a long lever arm. Five-layer damping pipeline: tilt correction moving-average filter (deque) speed cap EMA damping dead-zone filter.
**Dual-Channel Parse Architecture**
Simple commands (release/reset/directional moves) bypass the LLM entirely via a regex engine (microseconds). Only complex commands containing object names reach the LLM (<200ms). This prevents the common failure mode where "move down 3cm" gets misclassified as a `lift` action.
**Pre-filling to Skip Chain-of-Thought**
DeepSeek-R1 outputs a `<think>...</think>` chain-of-thought by default. Appending `<Assistant>` as a pre-fill token forces the model to skip the thinking phase and emit JSON directly, achieving 100% format compliance.
**Whisper Anti-Hallucination**
Three defences, all encapsulated in `RobotEar.get_text()`: silence trimming + duration guards; `condition_on_previous_text=False`; repeated-phrase regex dedup (removes "向右向右向右..." loops). All thresholds are tunable via `config.py`.
**Engineering Pitfall: System Prompt Alignment**
The system prompt at inference must exactly match the one used during fine-tuning. Any mismatch causes output drift (e.g., outputting 500mm instead of 50mm). A warning comment is included in the source.
---
## LLM Training
~500 domain-specific samples, QLoRA fine-tune of DeepSeek-R1-1.5B, loss converged to 0.0519, format error rate 0%.
See [`TRAINING.md`](TRAINING.md) for the full guide: QLoRA hyperparameter config, GGUF vs Transformers comparison, pre-filling inference details, and experiment results.
---
## Project Structure
```
robot_arm/
├── README.md Chinese documentation
├── README_EN.md This file
├── TRAINING.md LLM LoRA fine-tuning research notes
├── requirements.txt Python dependencies
├── config.py All tunables: hardware, motion, audio & gesture constants
├── main.ino ESP32 firmware, LEDC PWM servo control
├── arm_main.py Kinematics core: D-H IK + S-Curve trajectory
├── whisper_main.py Full ASR pipeline: silence trim → transcribe → post-process
└── voice_main.py Main app: voice → LLM → vision → motion
```
---
## Key Specs
| Metric | Value |
|:---|:---|
| Hardware cost | ¥317 (~$45 USD) |
| GPU requirement | RTX 3060 6GB (<4GB VRAM at runtime) |
| Inference latency | <200ms (LLM), <50ms (rule engine) |
| Training samples | ~500 |
| Format error rate | 0% |
| Operation mode | Fully offline |