Files
harmony/fleet/harmony-fleet-agent
Jean-Gabriel Gill-Couture 29896bfeab
Some checks failed
Run Check Script / check (pull_request) Failing after 2m15s
fix(zitadel,operator): user-grant search endpoint + operator keyfile mode
Two bugs uncovered while running the full e2e walk end to end:

1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search
   which Zitadel rejects with 405 Method Not Allowed (the original
   author's note in the comment hinted at this). The cache previously
   masked it: first apply created the grant + cached the id; second
   apply hit the cache and skipped the broken search. The live-query
   refactor (f4d6fb94) removed the cache short-circuit, surfacing
   the bug as "Create user grant failed: User grant already exists"
   on every re-apply.

   Fix: switch to the collection endpoint
   /management/v1/users/grants/_search with a userIdQuery filter,
   matching the Zitadel API that's actually wired up. Now returns
   the existing grant on re-apply and the create_user_grant fallback
   is correctly skipped.

2. Operator keyfile mounted as 0o400 owned by root. The operator pod
   runs as non-root (image USER directive — no fixed runAsUser
   because we want SCC compatibility). Result: operator boots,
   tries to load the JSON keyfile from the Secret volume, hits
   EACCES, fails the credential factory, retries forever.

   Fix: mode 0o444. World-read inside the pod is fine — single
   container, no other consumers, the Secret namespace is locked
   down, and the file never escapes pod-fs. The proper fsGroup-based
   alternative requires pinning a UID/GID, which conflicts with our
   SCC-friendly choice of leaving runAsUser unset.

Also fixes a stale `git rm` from commit 4194baac
(harmony-fleet-auth extraction) — the agent's local credentials.rs
was deleted from disk but never staged.

Verified end to end:
  * STACK READY in 2 min on warm cluster
  * Operator pod: "minted fresh Zitadel access token", "NATS connected",
    "starting Deployment controller", "watching device-info KV"
  * 2 Device CRs auto-created with full label set
  * `kubectl apply -f` of a Deployment CR with
    targetSelector.matchLabels: { group: group-a } produced:
      - status.aggregate { matched=1, succeeded=1, failed=0 }
      - HTTP 200 from nginx on vm-device-00:8080
      - connection refused from vm-device-01:8080 (correctly excluded)
2026-05-05 06:55:24 -04:00
..