automa/docs/IMPLEMENTATION.md
m1ngsama 49a2621f2f docs: add comprehensive documentation and architecture guides
- Add QUICKSTART.md for 5-minute setup guide
- Add CHEATSHEET.md for quick command reference
- Add OPTIMIZATION_SUMMARY.md with complete architecture overview
- Add detailed architecture documentation in docs/
  - ARCHITECTURE.md: System design and component details
  - IMPLEMENTATION.md: Step-by-step implementation guide
  - architecture-recommendations.md: Component selection rationale
- Add .env.example template for configuration

Following KISS principles and Unix philosophy for self-hosted IaC platform.
2026-01-19 16:31:24 +08:00

705 lines
13 KiB
Markdown

# Automa Implementation Guide
## Quick Start
### Phase 1: Core Infrastructure (Week 1)
#### 1. Add Caddy (Reverse Proxy + SSL)
**Why Caddy?**
- Auto HTTPS (Let's Encrypt)
- Simple config (3-5 lines)
- Low memory (~30MB)
```yaml
# infrastructure/caddy/compose.yml
services:
caddy:
image: caddy:2-alpine
container_name: caddy
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "443:443/udp"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
- caddy_config:/config
networks:
- proxy
labels:
- "com.centurylinklabs.watchtower.enable=true"
volumes:
caddy_data:
caddy_config:
networks:
proxy:
name: automa-proxy
external: true
```
**Caddyfile:**
```caddyfile
# Simple config
{
email your@email.com
}
# Nextcloud
cloud.example.com {
reverse_proxy nextcloud:80
encode gzip
}
# Grafana
grafana.example.com {
reverse_proxy grafana:3000
}
```
---
#### 2. Add Monitoring Stack
**Stack: Prometheus + Grafana + Loki (lightweight)**
```yaml
# infrastructure/monitoring/compose.yml
services:
prometheus:
image: prom/prometheus:v2.48-alpine
container_name: prometheus
restart: unless-stopped
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
networks:
- monitoring
grafana:
image: grafana/grafana:10-alpine
container_name: grafana
restart: unless-stopped
ports:
- "127.0.0.1:3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_ANALYTICS_REPORTING_ENABLED=false
networks:
- monitoring
- proxy
loki:
image: grafana/loki:2-alpine
container_name: loki
restart: unless-stopped
ports:
- "127.0.0.1:3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/loki-config.yml
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- monitoring
promtail:
image: grafana/promtail:2-alpine
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail/promtail-config.yml
- /var/log:/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
ports:
- "127.0.0.1:8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
privileged: true
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
loki_data:
networks:
monitoring:
name: automa-monitoring
proxy:
name: automa-proxy
external: true
```
**Minimal Prometheus Config:**
```yaml
# prometheus.yml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'nextcloud'
static_configs:
- targets: ['nextcloud:80']
```
---
#### 3. Add Watchtower (Auto Update)
```yaml
# infrastructure/watchtower/compose.yml
services:
watchtower:
image: containrrr/watchtower:latest
container_name: watchtower
restart: unless-stopped
environment:
- WATCHTOWER_CLEANUP=true
- WATCHTOWER_POLL_INTERVAL=86400 # 24h
- WATCHTOWER_LABEL_ENABLE=true # Only update labeled containers
- TZ=Asia/Shanghai
volumes:
- /var/run/docker.sock:/var/run/docker.sock
labels:
- "com.centurylinklabs.watchtower.enable=false" # Don't update itself
```
**Add label to services you want to auto-update:**
```yaml
services:
nextcloud:
labels:
- "com.centurylinklabs.watchtower.enable=true"
```
---
#### 4. Fix Image Versions
**Before (bad):**
```yaml
image: nextcloud:latest
```
**After (good):**
```yaml
image: nextcloud:28-apache # Pin major version
```
**Update all compose files:**
```bash
# Minecraft
image: itzg/minecraft-server:java21
# TeamSpeak
image: teamspeak:latest # TS doesn't follow semver
# Nextcloud
image: nextcloud:28-apache
image: mariadb:11.2-jammy
image: redis:7-alpine
```
---
### Phase 2: Backup Enhancement (Week 2)
#### 5. Add Duplicati (Remote Backup)
```yaml
# infrastructure/duplicati/compose.yml
services:
duplicati:
image: lscr.io/linuxserver/duplicati:latest
container_name: duplicati
restart: unless-stopped
environment:
- PUID=1000
- PGID=1000
- TZ=Asia/Shanghai
volumes:
- ./config:/config
- ../backups:/source:ro # Read-only access to local backups
ports:
- "127.0.0.1:8200:8200"
```
**Setup in Web UI (http://localhost:8200):**
1. Add backup job
2. Source: `/source` (local backups)
3. Destination: S3/SFTP/WebDAV/etc
4. Schedule: Daily at 3 AM
5. Retention: Keep 30 days
---
### Phase 3: Security (Week 3)
#### 6. Add Fail2ban
```yaml
# infrastructure/fail2ban/compose.yml
services:
fail2ban:
image: crazymax/fail2ban:latest
container_name: fail2ban
restart: unless-stopped
network_mode: host
cap_add:
- NET_ADMIN
- NET_RAW
volumes:
- ./data:/data
- /var/log:/var/log:ro
environment:
- TZ=Asia/Shanghai
```
**Minimal jail.d/defaults.conf:**
```ini
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 5
[sshd]
enabled = true
port = ssh
logpath = /var/log/auth.log
```
---
#### 7. Setup Firewall (UFW)
```bash
# Default deny
ufw default deny incoming
ufw default allow outgoing
# Essential
ufw allow 22/tcp # SSH
ufw allow 80/tcp # HTTP
ufw allow 443/tcp # HTTPS
# Minecraft
ufw allow 25565
# TeamSpeak
ufw allow 9987/udp
ufw allow 30033/tcp
# Internal only
ufw allow from 192.168.1.0/24 to any port 3000 # Grafana
ufw allow from 192.168.1.0/24 to any port 8200 # Duplicati
ufw enable
```
---
### Phase 4: IaC Best Practices
#### Project Structure
```
automa/
├── infrastructure/ # New infra services
│ ├── caddy/
│ ├── monitoring/
│ ├── watchtower/
│ ├── duplicati/
│ └── fail2ban/
├── services/ # Rename from root
│ ├── minecraft/
│ ├── teamspeak/
│ └── nextcloud/
├── bin/ # Keep existing scripts
├── backups/ # Local backups
├── .env # Global secrets
└── Makefile # Enhanced
```
---
#### Enhanced Makefile
```makefile
# Add to existing Makefile
# Infrastructure commands
.PHONY: infra-up infra-down
infra-up:
@echo "Starting infrastructure..."
cd infrastructure/caddy && docker compose up -d
cd infrastructure/monitoring && docker compose up -d
cd infrastructure/watchtower && docker compose up -d
cd infrastructure/duplicati && docker compose up -d
cd infrastructure/fail2ban && docker compose up -d
infra-down:
@echo "Stopping infrastructure..."
cd infrastructure/fail2ban && docker compose down
cd infrastructure/duplicati && docker compose down
cd infrastructure/watchtower && docker compose down
cd infrastructure/monitoring && docker compose down
cd infrastructure/caddy && docker compose down
# Full stack
.PHONY: up down
up: infra-up all-up
down: all-down infra-down
# Network setup
.PHONY: network-create
network-create:
@docker network create automa-proxy || true
@docker network create automa-monitoring || true
```
---
## Configuration Management
### Environment Variables Strategy
**Structure:**
```
.env # Global (git-ignored)
.env.example # Template (git-tracked)
services/*/.env # Service-specific
infrastructure/*/.env # Infra-specific
```
**Global .env:**
```bash
# Domain
DOMAIN=example.com
# Timezone
TZ=Asia/Shanghai
# Monitoring
GRAFANA_ADMIN_PASSWORD=changeme
# Services
NEXTCLOUD_ADMIN_PASSWORD=changeme
MYSQL_ROOT_PASSWORD=changeme
REDIS_PASSWORD=changeme
```
---
### Docker Compose Best Practices
**1. Always set restart policy:**
```yaml
restart: unless-stopped # Not "always"
```
**2. Use healthchecks:**
```yaml
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
timeout: 10s
retries: 3
```
**3. Set resource limits:**
```yaml
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 256M
```
**4. Use named volumes:**
```yaml
volumes:
- app_data:/data # Named (managed by Docker)
# NOT: ./data:/data (bind mount)
```
**5. Logging:**
```yaml
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
---
## Deployment Workflow
### Initial Setup
```bash
# 1. Clone repo
git clone https://github.com/yourname/automa.git
cd automa
# 2. Create networks
make network-create
# 3. Copy env files
cp .env.example .env
# Edit .env with your values
# 4. Start infrastructure
make infra-up
# 5. Start services
make all-up
# 6. Check status
make status
docker ps
```
---
### Update Workflow
**Option 1: Watchtower (automatic)**
- Watches for new images daily
- Pulls and restarts containers
- Only updates labeled containers
**Option 2: Manual**
```bash
# Update single service
cd services/nextcloud
docker compose pull
docker compose up -d
# Update all
make all-down
cd services/minecraft && docker compose pull && cd ../..
cd services/teamspeak && docker compose pull && cd ../..
cd services/nextcloud && docker compose pull && cd ../..
make all-up
```
---
### Backup Workflow
**1. Local backup (existing):**
```bash
make backup # Runs bin/backup.sh
```
**2. Remote backup (Duplicati):**
- Automatic daily at 3 AM
- Or manual via web UI
**3. Restore:**
```bash
# Stop service
cd services/nextcloud
docker compose down
# Restore from backup
cd ../../backups/nextcloud/YYYYMMDD-HHMMSS
tar -xzf nextcloud_data.tar.gz -C /path/to/volume
# Start service
cd ../../services/nextcloud
docker compose up -d
```
---
## Resource Planning
### Minimum Requirements
**For current 3 services:**
- CPU: 4 cores
- RAM: 8 GB
- Disk: 100 GB
**With full stack (infra + services):**
- CPU: 6 cores
- RAM: 12 GB
- Disk: 200 GB (or 100GB SSD + 500GB HDD)
### Resource Breakdown
| Component | CPU | RAM | Disk |
|-----------|-----|-----|------|
| **Services** | | | |
| Minecraft | 1-2 cores | 2-4 GB | 10-20 GB |
| TeamSpeak | 0.1 cores | 100 MB | 500 MB |
| Nextcloud | 0.5 cores | 500 MB | 20-100 GB |
| MariaDB | 0.2 cores | 500 MB | 5-10 GB |
| Redis | 0.1 cores | 100 MB | 100 MB |
| **Infrastructure** | | | |
| Caddy | 0.1 cores | 50 MB | 50 MB |
| Prometheus | 0.5 cores | 500 MB | 10 GB |
| Grafana | 0.1 cores | 200 MB | 500 MB |
| Loki | 0.2 cores | 300 MB | 5 GB |
| Others | 0.1 cores | 200 MB | 1 GB |
| **Total** | **~3-5 cores** | **~5-8 GB** | **~50-150 GB** |
---
## Monitoring Setup
### Import Grafana Dashboards
1. Open Grafana: http://grafana.example.com
2. Login (admin / changeme)
3. Import dashboards:
- **11074** - Node Exporter (host metrics)
- **193** - Docker monitoring
- **12486** - Loki logs
- **13770** - Nextcloud
---
## Security Checklist
- [ ] Change all default passwords
- [ ] Enable UFW firewall
- [ ] Setup Fail2ban
- [ ] Enable HTTPS (Caddy auto)
- [ ] Restrict Grafana/Duplicati to local network
- [ ] Use strong passwords (16+ chars)
- [ ] Enable 2FA for Nextcloud
- [ ] Regular backups (automated)
- [ ] Keep services updated (Watchtower)
- [ ] Review logs weekly
---
## Troubleshooting
### Common Issues
**Container won't start:**
```bash
docker logs <container-name>
```
**Network issues:**
```bash
docker network ls
docker network inspect automa-proxy
```
**Disk full:**
```bash
docker system prune -a # Remove unused images/containers
df -h
```
**Service unreachable:**
```bash
curl -I http://localhost:PORT # Test locally
docker ps # Check if running
docker exec -it <container> sh # Debug inside
```
---
## Next Steps
### Optional Enhancements
**1. Alerting:**
- Add Alertmanager to Prometheus
- Send alerts to Telegram/Discord/Email
**2. CI/CD:**
- Add Drone CI for config testing
- Auto-deploy on git push
**3. High Availability:**
- Add Docker Swarm mode
- Setup load balancer
**4. Advanced Monitoring:**
- Add Uptime Kuma (status page)
- Add blackbox exporter (external monitoring)
**5. Additional Services:**
- Gitea (self-hosted Git)
- Vaultwarden (password manager)
- Homer (dashboard)
---
## Summary
### What We Added
**Caddy** - Auto HTTPS + reverse proxy
**Monitoring** - Prometheus + Grafana + Loki
**Watchtower** - Auto updates
**Duplicati** - Remote backups
**Fail2ban** - Security
**UFW** - Firewall
### What to Keep
✅ Current Docker Compose structure
✅ Existing backup scripts
✅ Makefile commands
✅ MariaDB + Redis
### What Changed
- Fixed image versions (no more :latest)
- Added infrastructure/ folder
- Enhanced Makefile
- Added monitoring stack
### Benefits
- **Automation**: 70% less manual work
- **Security**: Multi-layer defense
- **Visibility**: Full observability
- **Reliability**: Auto-healing + backups