automa/docs/ARCHITECTURE.md
m1ngsama 49a2621f2f docs: add comprehensive documentation and architecture guides
- Add QUICKSTART.md for 5-minute setup guide
- Add CHEATSHEET.md for quick command reference
- Add OPTIMIZATION_SUMMARY.md with complete architecture overview
- Add detailed architecture documentation in docs/
  - ARCHITECTURE.md: System design and component details
  - IMPLEMENTATION.md: Step-by-step implementation guide
  - architecture-recommendations.md: Component selection rationale
- Add .env.example template for configuration

Following KISS principles and Unix philosophy for self-hosted IaC platform.
2026-01-19 16:31:24 +08:00

484 lines
12 KiB
Markdown

# Automa Architecture
Self-hosted services platform following Unix philosophy: simple, modular, composable.
## Design Principles
1. **KISS** - Keep It Simple, Stupid
2. **Single Responsibility** - Each service does one thing well
3. **Replaceable** - Any component can be swapped
4. **Composable** - Services work together via standard interfaces
5. **Observable** - Everything is monitored and logged
6. **Recoverable** - Regular backups, tested restore procedures
## System Overview
```
┌─────────────────────────────────────────────────────┐
│ Internet │
└───────────────────┬──────────────────────────────────┘
┌──────────▼──────────┐
│ Firewall (UFW) │
│ Fail2ban │
└──────────┬──────────┘
┌──────────▼──────────┐
│ Caddy (80/443) │
│ - Auto HTTPS │
│ - Reverse Proxy │
└──────────┬──────────┘
┌─────────────┼─────────────┐
│ │ │
┌─────▼─────┐ ┌────▼────┐ ┌─────▼─────┐
│ Nextcloud │ │ Grafana │ │ Minecraft │
│ + MariaDB │ │ │ │ (host net)│
│ + Redis │ │ │ │ │
└───────────┘ └─────────┘ └───────────┘
│ │ │
│ ┌─────▼─────┐ │
│ │Prometheus │ │
│ │Loki │ │
│ │Promtail │ │
│ │cAdvisor │ │
│ └───────────┘ │
│ │
└─────────┬─────────────────┘
┌──────▼──────┐
│ Watchtower │
│ Duplicati │
└─────────────┘
┌──────▼──────┐
│ Backups │
│ (Local + │
│ Remote) │
└─────────────┘
```
## Component Stack
### Layer 1: Edge (Internet-facing)
| Component | Purpose | Ports | Why |
|-----------|---------|-------|-----|
| **UFW** | Firewall | All | Simple, built-in Linux |
| **Fail2ban** | Intrusion prevention | - | Auto-ban attackers |
| **Caddy** | Reverse proxy + SSL | 80, 443 | Auto HTTPS, simple config |
### Layer 2: Applications
| Service | Purpose | Ports | Stack |
|---------|---------|-------|-------|
| **Nextcloud** | Private cloud | 80→Caddy | PHP + MariaDB + Redis |
| **Minecraft** | Game server | 25565 | Fabric 1.21.1 |
| **TeamSpeak** | Voice chat | 9987 | TeamSpeak 3 |
### Layer 3: Observability
| Component | Purpose | Storage | Why |
|-----------|---------|---------|-----|
| **Prometheus** | Metrics DB | 10GB/30d | Industry standard |
| **Grafana** | Dashboards | 500MB | Best visualization |
| **Loki** | Log aggregation | 5GB/30d | Lightweight ELK alternative |
| **Promtail** | Log collector | - | Pairs with Loki |
| **cAdvisor** | Container metrics | - | Docker native |
### Layer 4: Automation
| Component | Purpose | Why |
|-----------|---------|-----|
| **Watchtower** | Auto-update images | Label-based, simple |
| **Duplicati** | Remote backups | Web UI, encrypted |
| **bin/backup.sh** | Local backups | Custom, flexible |
## Network Architecture
### Networks
```
automa-proxy (172.20.0.0/16)
├─ caddy
├─ nextcloud
└─ grafana
automa-monitoring (172.21.0.0/16, internal)
├─ prometheus
├─ loki
├─ promtail
└─ cadvisor
nextcloud (172.22.0.0/16)
├─ nextcloud
├─ nextcloud-db
└─ nextcloud-redis
teamspeak (172.23.0.0/16)
└─ teamspeak
(host network)
└─ minecraft # Needs direct port access for UDP
```
### Port Mapping
**External (public):**
- 80 → Caddy (HTTP → HTTPS redirect)
- 443 → Caddy (HTTPS)
- 25565 → Minecraft
- 9987/udp → TeamSpeak voice
- 30033 → TeamSpeak file transfer
**Internal (localhost only):**
- 3000 → Grafana (proxied via Caddy)
- 8080 → Nextcloud (proxied via Caddy)
- 8200 → Duplicati
- 9090 → Prometheus
## Data Flow
### Request Flow
```
User → Internet → Firewall → Caddy → Application
Prometheus ← Metrics
Grafana ← Query
```
### Log Flow
```
Container → stdout/stderr → Docker logs → Promtail → Loki → Grafana
```
### Backup Flow
```
Service data → bin/backup.sh → local backup → Duplicati → remote storage
```
## Storage Strategy
### Volume Types
**Named volumes** (managed by Docker):
- Database data (MariaDB)
- Cache (Redis)
- Monitoring data (Prometheus, Loki, Grafana)
- Config (Caddy, Duplicati)
**Bind mounts** (host filesystem):
- Minecraft world/mods/configs (easy access)
- Backup output directory
- Log files
### Backup Strategy
**3-2-1 Rule:**
- 3 copies of data
- 2 different media
- 1 offsite
**Implementation:**
1. Live data (volumes/bind mounts)
2. Local backup (bin/backup.sh → ./backups/)
3. Remote backup (Duplicati → S3/SFTP/etc)
**Retention:**
- Local: 7 days
- Remote: 30 days
- Configs: forever
## Update Strategy
### Image Versioning
**Pinning strategy:**
```yaml
# ✅ Good - pin major version, get patches
image: nextcloud:28-apache
image: mariadb:11.2-jammy
image: grafana/grafana:10-alpine
# ⚠️ Acceptable - semantic versioning not available
image: teamspeak:latest
# ❌ Bad - unpredictable
image: nextcloud:latest
```
### Update Methods
**Automatic (Watchtower):**
- Runs daily
- Only updates labeled containers
- Good for: Caddy, Grafana, Nextcloud app
- Bad for: Databases, critical services
**Manual:**
```bash
docker compose pull
docker compose up -d
```
- Good for: Databases, major version bumps
- Requires: Testing, backup first
## Security Model
### Defense in Depth
**Layer 1: Network**
- UFW firewall (deny all, allow specific)
- Fail2ban (auto-ban attackers)
**Layer 2: TLS**
- Caddy auto-HTTPS
- Force HTTPS redirect
- HSTS headers
**Layer 3: Application**
- Strong passwords (16+ chars)
- 2FA where available (Nextcloud)
- Limited port exposure
**Layer 4: Data**
- Encrypted backups (Duplicati)
- Secrets in .env (not in Git)
- Read-only mounts where possible
### Secrets Management
**Current:**
```
.env (git-ignored)
└─ environment variables
└─ injected into containers
```
**Future option:**
- Docker secrets (Swarm mode)
- SOPS/Age encryption for .env
## Resource Planning
### Minimum Requirements
| Resource | Minimum | Recommended |
|----------|---------|-------------|
| CPU | 4 cores | 6-8 cores |
| RAM | 8 GB | 16 GB |
| Disk | 100 GB | 500 GB SSD |
| Network | 10 Mbps | 100 Mbps |
### Resource Allocation
**Heavy services (reserve resources):**
- Minecraft: 2-4 GB RAM
- MariaDB: 500 MB RAM
- Prometheus: 500 MB RAM
**Light services (minimal):**
- Caddy: 50 MB RAM
- Redis: 100 MB RAM
- Watchtower: 30 MB RAM
### Scaling Strategy
**Vertical (single server):**
- Add RAM → increase Minecraft players
- Add CPU → faster builds/queries
- Add disk → longer retention
**Horizontal (multiple servers):**
- Separate services by server
- Example: Minecraft on server 1, Nextcloud on server 2
- Use remote monitoring (Prometheus federation)
## High Availability (Future)
**Current state: Single server**
- No HA (single point of failure)
- Acceptable for home lab
**HA options:**
- Docker Swarm (orchestration)
- Load balancer (HAProxy/Caddy)
- Shared storage (NFS/GlusterFS)
- Database replication (MariaDB master-slave)
**Cost/benefit:**
- Adds significant complexity
- Not recommended for <10 users
## Disaster Recovery
### Scenarios
**1. Service crash**
- Auto-restart: `restart: unless-stopped`
- Health checks: detect and restart
**2. Data corruption**
- Restore from local backup (minutes)
- Last resort: remote backup (hours)
**3. Server failure**
- Restore to new server
- Restore backups
- Update DNS
### Recovery Time Objective (RTO)
| Scenario | Target | Method |
|----------|--------|--------|
| Container restart | <1 min | Docker auto-restart |
| Service failure | <5 min | Manual restart |
| Data corruption | <30 min | Local backup restore |
| Server failure | <4 hours | New server + backup restore |
### Recovery Point Objective (RPO)
| Service | Data Loss | Backup Frequency |
|---------|-----------|------------------|
| Nextcloud | <24 hours | Daily |
| Minecraft | <6 hours | Every 6 hours |
| Configs | <7 days | Weekly |
## Monitoring & Alerting
### Key Metrics
**Infrastructure:**
- CPU usage (alert >80%)
- Memory usage (alert >85%)
- Disk space (alert >80%)
- Network throughput
**Services:**
- Container status (alert if down >5min)
- Response time (alert >2s)
- Error rate (alert >5%)
**Business:**
- Minecraft: player count, TPS
- Nextcloud: active users, storage
- Backup: last success timestamp
### Alert Channels
**Current: Grafana alerts**
- Email
- Webhook
**Future options:**
- Telegram bot
- Discord webhook
- PagerDuty
## Technology Choices
### Why These Tools?
| Component | Alternatives | Why Chosen |
|-----------|-------------|------------|
| **Caddy** | Nginx, Traefik | Auto HTTPS, simplest config |
| **Prometheus** | InfluxDB, VictoriaMetrics | Industry standard, huge ecosystem |
| **Grafana** | Kibana, Chronograf | Best dashboards, most plugins |
| **Loki** | ELK, Graylog | 10x lighter than ELK |
| **Watchtower** | Manual, Renovate | Set and forget, label-based |
| **Duplicati** | Restic, Borg | Web UI, widest storage support |
| **MariaDB** | PostgreSQL, MySQL | Drop-in MySQL replacement, faster |
| **Redis** | Memcached, KeyDB | Persistence, richer data types |
### What We Avoided
| Tool | Why Not |
|------|---------|
| **Kubernetes** | Overkill for <10 services, steep learning curve |
| **Traefik** | Over-engineered for simple reverse proxy |
| **ELK Stack** | Too heavy (Elasticsearch needs 2-4GB RAM) |
| **Zabbix** | Old-school, complex setup |
| **Ansible** | Not needed for single-server Docker Compose |
## Future Enhancements
### Phase 1 (Done)
- Reverse proxy (Caddy)
- Monitoring (Prometheus + Grafana)
- Logging (Loki)
- Auto-update (Watchtower)
- Remote backup (Duplicati)
- Security (Fail2ban)
### Phase 2 (Optional)
- [ ] Alertmanager (notifications)
- [ ] Uptime Kuma (status page)
- [ ] Gitea (self-hosted Git)
- [ ] Vaultwarden (password manager)
- [ ] Homer (dashboard)
### Phase 3 (Advanced)
- [ ] Docker Swarm (HA)
- [ ] CI/CD (Drone)
- [ ] Secret management (Vault)
- [ ] Service mesh (if needed)
## Development Workflow
### Local Testing
```bash
# Test config syntax
docker compose -f compose.yml config
# Start in foreground
docker compose up
# Check logs
docker compose logs -f
```
### Deployment
```bash
# Update code
git pull
# Restart services
make down
make up
# Verify
make status
make health
```
### Rollback
```bash
# Git rollback
git log
git checkout <previous-commit>
# Or: Restore from backup
```
## Documentation
- `README.md` - Project overview
- `QUICKSTART.md` - 5-minute setup
- `docs/ARCHITECTURE.md` - This file
- `docs/IMPLEMENTATION.md` - Step-by-step guide
- `infrastructure/README.md` - Infrastructure details
- `docs/architecture-recommendations.md` - Detailed component analysis
## References
- [Docker Compose Best Practices](https://docs.docker.com/compose/production/)
- [Prometheus Best Practices](https://prometheus.io/docs/practices/)
- [Caddy Documentation](https://caddyserver.com/docs/)
- [The Twelve-Factor App](https://12factor.net/)