Files
macha-autonomous/DESIGN.md
Lily Miller 2f367f7cdc Refactor: Centralize command patterns in single source of truth
CRITICAL: Prevents inconsistent sudo/SSH patterns across codebase.

Created command_patterns.py with:
- Single source of truth for ALL command execution patterns
- SSH key path constant: /var/lib/macha/.ssh/id_ed25519
- Remote user constant: macha
- sudo prefix for all remote commands
- Helper functions: build_ssh_command(), transform_ssh_command()
- Self-validation tests

Updated files to use centralized patterns:
- tools.py: Uses transform_ssh_command()
- remote_monitor.py: Uses build_ssh_command()
- system_discovery.py: Uses build_ssh_command()
- DESIGN.md: Documents centralized approach

Benefits:
- Impossible to have inconsistent patterns
- Single place to update if needed
- Self-documenting with validation tests
- Prevents future refactoring errors

DO NOT duplicate these patterns in other files - always import.
2025-10-06 16:06:31 -06:00

9.9 KiB

Macha Autonomous System - Design Document

⚠️ IMPORTANT - READ THIS FIRST
FOR AI ASSISTANT: This document is YOUR reference guide when modifying Macha's code.

  • ALWAYS consult this BEFORE refactoring to ensure you don't remove existing capabilities
  • CHECK this when adding features to avoid conflicts
  • UPDATE this document when new capabilities are added
  • DO NOT DELETE ANYTHING FROM THIS DOCUMENT
  • During major refactors, you MUST verify each capability listed here is preserved

Overview

Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.

Core Capabilities

1. Local System Management

  • Monitor system health (CPU, memory, disk, services)
  • Read and analyze logs via journalctl
  • Check service status and restart failed services
  • Execute system commands (with safety restrictions)
  • Monitor and repair Nix store corruption
  • Hardware awareness (CPU, GPU, network, storage)

2. Multi-Host Management via SSH

Macha CAN and SHOULD use SSH to manage other hosts.

SSH Access

  • CRITICAL: All command patterns defined in command_patterns.py (SINGLE SOURCE OF TRUTH)
  • Always uses explicit SSH key path: -i /var/lib/macha/.ssh/id_ed25519
  • All SSH commands automatically include the -i flag with absolute key path
  • Remote commands always prefixed with sudo
  • Runs as macha user (UID 2501)
  • DO NOT DUPLICATE these patterns elsewhere - import from command_patterns.py
  • Has NOPASSWD sudo access for administrative commands
  • Shares SSH keys with other hosts in the infrastructure
  • Can SSH to: rhiannon, alexander, UCAR-Kinston, and others in the flake

SSH Usage Patterns

  1. Direct diagnostic commands:

    ssh rhiannon systemctl status ollama
    ssh alexander df -h
    
    • Commands automatically transformed by the tools layer
    • Full command: ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama
    • SSH key path is always explicit, commands are automatically prefixed with sudo
  2. Status checks:

    • Check service health on remote hosts
    • Gather system metrics
    • Review logs
    • Monitor resource usage
  3. File operations:

    • Use scp to copy files between hosts
    • Read configuration files on remote systems

When to use SSH vs nh

  • SSH: For diagnostics, status checks, log review, quick commands
  • nh remote deployment: For applying NixOS configuration changes
    • nh os switch -u --target-host=rhiannon --hostname=rhiannon
    • Builds locally, deploys to remote host
    • Use for permanent configuration changes

3. NixOS Configuration Management

Local Changes

  • Can propose changes to NixOS configuration
  • Requires human approval before applying
  • Uses nh os switch for local updates

Remote Deployment

  • Can deploy to other hosts using nh with --target-host
  • Builds configuration locally (on Macha)
  • Pushes to remote system
  • Can take up to 1 hour for complex builds
  • IMPORTANT: Be patient with long-running builds, don't retry prematurely

4. Hardware Awareness

Local Hardware Detection

  • CPU: lscpu via nix-shell -p util-linux
  • GPU: lspci via nix-shell -p pciutils
  • Network: lsblk, ip addr
  • Storage: df -h, lsblk
  • USB devices: lsusb

GPU Metrics

  • AMD GPUs: Try rocm-smi, sysfs (/sys/class/drm/card*/device/)
  • NVIDIA GPUs: Try nvidia-smi
  • Fallback: sensors for temperature data
  • Queries: temperature, utilization, clock speeds, power usage

5. Ollama Queue System

Architecture

  • File-based queue: /var/lib/macha/queues/ollama/
  • Queue worker: ollama-queue-worker.service (runs as macha user)
  • Purpose: Serialize all LLM requests to prevent resource contention

Request Flow

  1. Any user (including regular users) → Write request to pending/
  2. Queue worker → Process requests serially (FIFO with priority)
  3. Queue worker → Write response to completed/
  4. Original requester → Read response from completed/

Priority Levels

  • INTERACTIVE (0): User requests via macha-chat, macha-ask
  • AUTONOMOUS (1): Background maintenance checks
  • BATCH (2): Low-priority bulk operations

Large Output Handling

  • Outputs >8KB: Split into chunks for hierarchical processing
  • Each chunk ~8KB (~2000 tokens)
  • Process chunks serially with progress feedback
  • Generate chunk summaries → meta-summary
  • Full outputs cached in /var/lib/macha/tool_cache/

6. Knowledge Base & Learning

ChromaDB Collections

  1. System Context: Infrastructure topology, service relationships
  2. Issues: Historical problems and resolutions
  3. Knowledge: Operational wisdom learned from experience

Automatic Learning

  • After successful operations, Macha reflects and extracts key learnings
  • Stores: topic, knowledge content, category
  • Retrieved automatically when relevant to current tasks
  • Use macha-knowledge CLI to view/manage

7. Notifications

Gotify Integration

  • Can send notifications via macha-notify command
  • Tool: send_notification(title, message, priority)

Priority Levels

  • 2 (Low/Info): Routine status updates, completed tasks
  • 5 (Medium/Attention): Important events, configuration changes
  • 8 (High/Critical): Service failures, critical errors, security issues

When to Notify

  • Critical service failures
  • Successful completion of major operations
  • Configuration changes that may affect users
  • Security-related events
  • When explicitly requested by user

8. Safety & Constraints

Command Restrictions

Allowed Commands (see tools.py for full list):

  • System management: systemctl, journalctl, nh, nixos-rebuild
  • Monitoring: free, df, uptime, ps, top, ip, ss
  • Hardware: lscpu, lspci, lsblk, lshw, dmidecode
  • Remote: ssh, scp
  • Power: reboot, shutdown, poweroff (use cautiously!)
  • File ops: cat, ls, grep
  • Network: ping, dig, nslookup, curl, wget
  • Logging: logger

NOT Allowed:

  • Direct package modifications (nix-env, nix profile)
  • Destructive file operations (rm -rf, dd)
  • User management outside of NixOS config
  • Direct editing of system files (use NixOS config instead)

Critical Services

Never disable or stop:

  • SSH (network access)
  • Networking (connectivity)
  • systemd (system management)
  • Boot-related services

Approval Required

  • Reboots or system power changes
  • Major configuration changes
  • Disabling any service
  • Changes to multiple hosts

9. Nix Store Maintenance

Verification & Repair

  • Command: nix-store --verify --check-contents --repair
  • WARNING: Can take 30+ minutes to several hours
  • Only use when corruption is suspected
  • Not for routine maintenance
  • Verifies all store paths, repairs corrupted files

Garbage Collection

  • Automatic via system configuration
  • Can be triggered manually with approval
  • Frees disk space by removing unused derivations

10. Conversational Behavior

Distinguish Requests from Acknowledgments

  • "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
  • "Can you..." / "Please..." → Request (execute)
  • "What is..." / "How do..." → Question (answer)

Tool Calling

  • Don't repeat tool calls unnecessarily
  • If a tool succeeds, don't run it again unless asked
  • Use cached results when available (retrieve_cached_output)

Context Management

  • Be aware of token limits
  • Use hierarchical processing for large outputs
  • Prune conversation history intelligently
  • Cache and summarize when needed

Infrastructure Topology

Hosts in Flake

  • macha: Main autonomous system (self), GPU server
  • rhiannon: Production server
  • alexander: Production server
  • UCAR-Kinston: Work laptop
  • test-vm: Testing environment

Shared Configuration

  • All hosts share root SSH keys (for nh remote deployment)
  • macha user (UID 2501) exists on all hosts
  • Common NixOS configuration via flake

Service Ecosystem

Core Services on Macha

  • ollama.service: LLM inference engine
  • ollama-queue-worker.service: Request serialization
  • macha-autonomous.service: Autonomous monitoring loop
  • Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
  • Media: Transmission, SABnzbd, Calibre

State Directories

  • /var/lib/macha/: Main state directory (0755, macha:macha)
  • /var/lib/macha/queues/: Queue directories (0777 for multi-user)
  • /var/lib/macha/tool_cache/: Cached tool outputs (0777)
  • /var/lib/macha/system_context.db: ChromaDB database

CLI Tools

  • macha-chat: Interactive chat with tool calling
  • macha-ask: Single-question interface
  • macha-check: Trigger immediate health check
  • macha-approve: Approve pending actions
  • macha-logs: View autonomous service logs
  • macha-issues: Query issue database
  • macha-knowledge: Query knowledge base
  • macha-systems: List managed systems
  • macha-notify: Send Gotify notification

Philosophy & Principles

  1. KISS (Keep It Simple, Stupid): Use existing NixOS options, avoid custom wrappers
  2. Verify first: Check source code/documentation before acting
  3. Safety first: Never break critical services, always require approval for risky changes
  4. Learn continuously: Extract and store operational knowledge
  5. Multi-host awareness: Macha manages the entire infrastructure, not just herself
  6. User-friendly: Clear communication, appropriate notifications
  7. Patience: Long-running operations (builds, repairs) can take an hour - don't panic
  8. Tool reuse: Use existing, verified tools instead of writing custom scripts

Future Capabilities (Not Yet Implemented)

  • Automatic security updates across all hosts
  • Predictive failure detection
  • Resource optimization recommendations
  • Integration with other communication platforms
  • Multi-agent coordination between hosts
  • Automated testing before deployment