diff --git a/DESIGN.md b/DESIGN.md index 8f6e8d0..da8905e 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -20,6 +20,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Execute system commands (with safety restrictions) - Monitor and repair Nix store corruption - Hardware awareness (CPU, GPU, network, storage) +- USB device detection and monitoring ### 2. Multi-Host Management via SSH @@ -30,20 +31,21 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Always uses explicit SSH key path: `-i /var/lib/macha/.ssh/id_ed25519` - All SSH commands automatically include the `-i` flag with absolute key path - Remote commands always prefixed with `sudo` -- Runs as `macha` user (UID 2501) +- Runs as `macha` user (UID 2501) for standard operations +- **Note**: Some internal operations (like remote monitoring) may use `root` SSH for privileged access - **DO NOT DUPLICATE these patterns elsewhere** - import from `command_patterns.py` - Has `NOPASSWD` sudo access for administrative commands - Shares SSH keys with other hosts in the infrastructure -- Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake +- Can SSH to any hosts defined in your NixOS flake configuration #### SSH Usage Patterns 1. **Direct diagnostic commands:** ```bash - ssh rhiannon systemctl status ollama - ssh alexander df -h + ssh hostname systemctl status service-name + ssh hostname df -h ``` - Commands automatically transformed by the tools layer - - Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama` + - Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name` - SSH key path is always explicit, commands are automatically prefixed with `sudo` 2. **Status checks:** @@ -59,11 +61,34 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma #### When to use SSH vs nh - **SSH**: For diagnostics, status checks, log review, quick commands - **nh remote deployment**: For applying NixOS configuration changes - - `nh os switch -u --target-host=rhiannon --hostname=rhiannon` + - `nh os switch -u --target-host=hostname --hostname=hostname` - Builds locally, deploys to remote host - Use for permanent configuration changes -### 3. NixOS Configuration Management +### 3. Automatic System Discovery + +#### Discovery from Journal Logs +- Monitors `systemd-journal-remote` logs for new systems sending logs +- Parses `_HOSTNAME` field from remote journal entries +- Automatically converts short hostnames to FQDNs (`.coven.systems`) +- Discovers systems within configurable time windows (default: 10 minutes) + +#### OS Detection +- Detects operating system via SSH (`/etc/os-release`, `uname`) +- Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD +- Falls back to generic "linux" if specific distro cannot be determined +- Records OS type in system registry for appropriate management strategies + +#### System Profiling +- Automatically gathers system information upon discovery: + - Running services (via `systemctl list-units`) + - Hardware info (CPU cores, memory) + - Capabilities (containers, web-server, database, remote-access) + - System role determination (ai-workstation, server, workstation, minimal) +- Registers discovered systems in ChromaDB context database +- Sends notification when new systems are discovered + +### 4. NixOS Configuration Management #### Local Changes - Can propose changes to NixOS configuration @@ -77,7 +102,28 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Can take up to 1 hour for complex builds - **IMPORTANT**: Be patient with long-running builds, don't retry prematurely -### 4. Hardware Awareness +### 5. Configuration File Intelligence (RAG) + +#### Semantic Configuration Search +- NixOS configuration files indexed in ChromaDB for semantic search +- Query configurations by natural language: "gotify configuration", "journald settings" +- CLI tools: `macha-configs ` and `macha-configs-read ` +- Relevance scoring helps find the right config files quickly +- Supports filtering by system hostname or category + +#### Configuration Categories +- `apps/` - Application configurations (Gotify, Nextcloud, etc.) +- `systems/` - Per-host system configurations +- `osconfigs/` - Operating system level settings +- `users/` - User management configurations + +#### Git Context Analysis +- Tracks recent changes to configuration files via `git_context.py` +- Correlates config changes with system behavior +- Provides context when investigating issues after deployments +- Helps understand "what changed" when debugging + +### 6. Hardware Awareness #### Local Hardware Detection - CPU: `lscpu` via `nix-shell -p util-linux` @@ -92,7 +138,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Fallback: `sensors` for temperature data - Queries: temperature, utilization, clock speeds, power usage -### 5. Ollama Queue System +### 7. Ollama Queue System #### Architecture - **File-based queue**: `/var/lib/macha/queues/ollama/` @@ -117,20 +163,32 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Generate chunk summaries → meta-summary - Full outputs cached in `/var/lib/macha/tool_cache/` -### 6. Knowledge Base & Learning +### 8. Knowledge Base & Learning + +#### ChromaDB Architecture +- **Service**: ChromaDB runs as a standalone service on port 8000 +- **Storage**: Data persisted at `/var/lib/chromadb` +- **Frontend**: `context_db.py` provides structured Python interface to ChromaDB +- **Connection**: HTTP client to `localhost:8000` #### ChromaDB Collections -1. **System Context**: Infrastructure topology, service relationships -2. **Issues**: Historical problems and resolutions -3. **Knowledge**: Operational wisdom learned from experience +1. **systems**: Infrastructure topology, registered hosts, OS types +2. **relationships**: System dependencies and relationships +3. **issues**: Historical problems and resolutions +4. **decisions**: AI decisions and outcomes +5. **config_files**: NixOS configuration files for RAG +6. **knowledge**: Operational wisdom learned from experience -#### Automatic Learning -- After successful operations, Macha reflects and extracts key learnings -- Stores: topic, knowledge content, category +#### Automatic Learning & Reflection +- After successful operations, Macha automatically reflects via `reflect_and_learn()` +- Extracts 1-2 specific, actionable learnings from each successful operation +- Stores: topic, knowledge content, category, confidence level +- Categories: command, pattern, troubleshooting, performance, general - Retrieved automatically when relevant to current tasks -- Use `macha-knowledge` CLI to view/manage +- Use `macha-knowledge` CLI to view/manage/search +- Use `seed_knowledge.py` to populate initial operational knowledge -### 7. Notifications +### 9. Notifications #### Gotify Integration - Can send notifications via `macha-notify` command @@ -146,9 +204,10 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Successful completion of major operations - Configuration changes that may affect users - Security-related events +- New system discoveries - When explicitly requested by user -### 8. Safety & Constraints +### 10. Safety & Constraints #### Command Restrictions **Allowed Commands** (see `tools.py` for full list): @@ -180,7 +239,14 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Disabling any service - Changes to multiple hosts -### 9. Nix Store Maintenance +#### Interactive Discussion +- `macha-approve discuss ` enables interactive Q&A about proposed actions +- Implemented via `conversation.py` module +- Users can ask follow-up questions before approving/rejecting +- Provides detailed explanations and reasoning +- Commands: `approve`, `reject`, `exit` to control flow + +### 11. Nix Store Maintenance #### Verification & Repair - Command: `nix-store --verify --check-contents --repair` @@ -194,7 +260,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - Can be triggered manually with approval - Frees disk space by removing unused derivations -### 10. Conversational Behavior +### 12. Conversational Behavior #### Distinguish Requests from Acknowledgments - "Thanks" / "Thank you" → Acknowledgment (don't re-execute) @@ -214,17 +280,16 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma ## Infrastructure Topology -### Hosts in Flake -- **macha**: Main autonomous system (self), GPU server -- **rhiannon**: Production server -- **alexander**: Production server -- **UCAR-Kinston**: Work laptop -- **test-vm**: Testing environment +### Managed Hosts +- **Self**: Main autonomous system running Macha +- **Configured hosts**: Systems defined in your NixOS flake +- **Auto-discovered hosts**: Additional systems detected via journal logs ### Shared Configuration - All hosts share root SSH keys (for `nh` remote deployment) -- `macha` user (UID 2501) exists on all hosts +- `macha` user (UID 2501) exists on all managed hosts - Common NixOS configuration via flake +- Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others ## Service Ecosystem @@ -232,14 +297,14 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - `ollama.service`: LLM inference engine - `ollama-queue-worker.service`: Request serialization - `macha-autonomous.service`: Autonomous monitoring loop -- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr -- Media: Transmission, SABnzbd, Calibre +- `chromadb.service`: Vector database for context and knowledge (port 8000) ### State Directories - `/var/lib/macha/`: Main state directory (0755, macha:macha) - `/var/lib/macha/queues/`: Queue directories (0777 for multi-user) - `/var/lib/macha/tool_cache/`: Cached tool outputs (0777) -- `/var/lib/macha/system_context.db`: ChromaDB database +- `/var/lib/macha/logs/`: Log files and closed issues archive +- `/var/lib/chromadb/`: ChromaDB vector database storage ## CLI Tools @@ -247,12 +312,138 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma - `macha-ask`: Single-question interface - `macha-check`: Trigger immediate health check - `macha-approve`: Approve pending actions + - `macha-approve list` - Show pending actions + - `macha-approve discuss ` - Interactive Q&A about action N + - `macha-approve approve ` - Approve action N + - `macha-approve reject ` - Reject action N - `macha-logs`: View autonomous service logs - `macha-issues`: Query issue database - `macha-knowledge`: Query knowledge base - `macha-systems`: List managed systems +- `macha-configs`: Semantic search for configuration files +- `macha-configs-read`: Read full configuration file content - `macha-notify`: Send Gotify notification +## Architecture & Key Modules + +### Core Modules + +#### `agent.py` - AI Agent +- Interfaces with Ollama LLM for reasoning +- Implements tool calling with `tools.py` +- Manages conversation history and context +- Automatic learning via `reflect_and_learn()` +- Supports queue-based and direct API modes + +#### `orchestrator.py` - Main Control Loop +- Continuous monitoring and health checks +- Coordinates all other components +- Manages check intervals and autonomy levels +- Initializes system registry and configuration parsing +- Handles system discovery and registration + +#### `executor.py` - Safe Action Execution +- Manages approval queue for actions +- Respects autonomy levels (observe, suggest, auto-safe, auto-full) +- Executes approved actions with safety checks +- Logs all actions and outcomes + +#### `tools.py` - System Administration Tools +- Defines all available tools for the AI agent +- Command allow-list for safe mode +- Executes system commands, reads files, checks services +- Implements hardware queries and GPU metrics +- Integrates with `command_patterns.py` for SSH + +#### `command_patterns.py` - SSH Command Patterns +- **SINGLE SOURCE OF TRUTH** for SSH commands +- Builds SSH commands with correct key paths and options +- Handles automatic sudo prefixing for remote commands +- Provides `build_ssh_command()` and `build_scp_command()` + +#### `context_db.py` - ChromaDB Frontend +- Structured interface to ChromaDB vector database +- Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge +- Implements semantic search for configurations and knowledge +- Tracks system relationships and dependencies + +#### `monitor.py` - Local System Monitoring +- Collects system health metrics (CPU, memory, disk) +- Checks systemd services and recent errors +- Monitors NixOS generations and Nix store size +- Generates human-readable summaries + +#### `remote_monitor.py` - Remote System Monitoring +- SSH-based monitoring of remote hosts +- Collects resources, services, disk, network status +- Verifies connectivity before operations +- Uses `command_patterns.py` for SSH access + +#### `system_discovery.py` - Auto-Discovery +- Discovers new systems from journal logs +- Detects OS types via SSH probing +- Profiles systems: services, hardware, capabilities +- Determines system roles automatically + +#### `issue_tracker.py` - Issue Management +- Creates, updates, resolves, and closes issues +- Finds similar past issues +- Auto-resolves issues when problems disappear +- Archives closed issues to JSONL logs + +#### `notifier.py` - Gotify Integration +- Sends notifications at appropriate priority levels +- Special methods for common events (failures, discoveries, actions) +- Fails gracefully if Gotify unavailable + +#### `ollama_queue.py` - Request Serialization +- File-based queue at `/var/lib/macha/queues/ollama/` +- Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH +- Prevents resource contention on LLM +- Tracks request status: pending → processing → completed/failed + +#### `ollama_worker.py` - Queue Worker Daemon +- Processes queue requests serially +- Runs as systemd service `ollama-queue-worker.service` +- Handles timeouts and failures gracefully + +#### `conversation.py` - Interactive Discussion +- Implements `macha-approve discuss` feature +- Enables Q&A about proposed actions +- Maintains context during discussion +- Helps users understand AI reasoning + +#### `config_parser.py` - Configuration File Parsing +- Parses NixOS configuration files from git repository +- Indexes configurations in ChromaDB for RAG +- Categorizes by type: apps, systems, osconfigs, users + +#### `git_context.py` - Git Analysis +- Tracks recent configuration changes +- Provides context when debugging after deployments +- Correlates config changes with system issues + +#### `journal_monitor.py` - Journal Log Monitoring +- Monitors systemd journal for specific patterns +- Triggers on error conditions +- Feeds into auto-discovery system + +#### `seed_knowledge.py` - Knowledge Seeding +- Populates initial operational knowledge +- Loads foundational patterns and commands +- Run via `macha-knowledge seed` + +#### `chat.py` - Interactive Chat Interface +- Implements `macha-chat` and `macha-ask` commands +- Manages conversation state +- Integrates with queue system for LLM requests + +#### `module.nix` - NixOS Module +- Defines all configuration options +- Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb) +- Sets up users, permissions, state directories +- Provides all CLI tool wrappers + ## Philosophy & Principles 1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers