SRE / Operator Handbook
This page is for on-call operators and SREs who need a repeatable way to check readiness, inventory the estate, and move through dashboard, alert, and access workflows safely.
Who It Is For
- On-call SREs.
- Platform and Grafana operators.
- Anyone who needs cross-org visibility, export/import checks, or break-glass access.
Primary Goals
- Confirm live readiness before you touch anything.
- Keep a reliable profile for routine checks and repeatable maintenance.
- Choose an auth path that can actually see the scope you need.
Before / After
- Before: SREs had to infer readiness, scope, and replay risk from a chain of ad hoc commands.
- After: use a repeatable profile, then move through live checks, staged review, and apply only after preflight.
What success looks like
- You can tell whether the credential really sees the scope you need.
- You can separate live reads from staged review and apply paths.
- You have a reliable operator path for dashboard, alert, and access workflows.
Failure checks
- If the token scope is narrower than the task, stop and fetch a credential that can see the real estate you need.
- If the live check passes but the apply path fails, verify write permissions and staged inputs before blaming the renderer.
- If you cannot explain which lane the task belongs to, pause and open the workflow chapter first.
Typical Operator Tasks
- Run a live readiness check before a maintenance window.
- Inspect dashboards or datasources across visible orgs.
- Inspect, check, and preview staged changes before an apply path.
- Export or review assets during backup, drift review, or break-glass recovery.
Recommended connection and secret handling
Use a profile backed by admin-capable credentials for day-to-day work.
--profilewithpassword_env,token_env, or an OS-backed secret store for repeatable operator use.- Direct Basic auth with
--prompt-passwordfor bootstrap or break-glass work. - Token auth only for narrow reads where you already know the token can see every target org and resource.
First commands to run
# Purpose: First commands to run.
grafana-util status live --profile prod --output-format table
# Purpose: First commands to run.
grafana-util overview live --profile prod --output-format interactive
# Purpose: First commands to run.
grafana-util change inspect --workspace .
# Purpose: First commands to run.
grafana-util change check --workspace . --fetch-live --output-format json
# Purpose: First commands to run.
grafana-util change preview --workspace . --fetch-live --output-format json
# Purpose: First commands to run.
grafana-util dashboard export --output-dir ./backups --overwrite --progress
If you need to start from the access layer instead, swap the last line for:
# Purpose: If you need to start from the access layer instead, swap the last line for.
grafana-util access org list --table
If you are checking a host directly, Basic auth is the safest fallback for broad visibility:
# Purpose: If you are checking a host directly, Basic auth is the safest fallback for broad visibility.
grafana-util status live --url http://localhost:3000 --basic-user admin --prompt-password --all-orgs --output-format table
Use token auth only when the scope matches the work:
# Purpose: Use token auth only when the scope matches the work.
grafana-util overview live --url http://localhost:3000 --token "$GRAFANA_API_TOKEN" --output-format json
What good operator posture looks like
You are in a good operator posture when:
- you can tell whether the current credential can really see the org or admin scope you need
- you can separate live reads from staged review from actual apply paths
- you run preflight or dry-run checks before destructive actions
- you know which command page to open when the surface shifts from status into dashboard, alert, or access work
Read next
- Change & Status
- Dashboard Management
- Data source Management
- Alerting Governance
- Access Management
- Troubleshooting
Keep open
Common mistakes and limits
- Do not assume a token can see
--all-orgs; that is one of the easiest ways to get partial inventory and miss a problem. - Do not paste
--basic-passwordinto shared shell history unless you are deliberately in a throwaway session. - Do not use
--show-secretsoutside a local, controlled inspection step. - Do not treat a successful read-only check as proof that write or admin workflows will also work.
- Do not skip
change check,change preview, or command-specific--dry-runpaths before high-impact changes.
When to switch to deeper docs
- Switch to Dashboard Management when the issue is inventory, export/import, inspection, or screenshot workflow.
- Switch to Alerting Governance when the problem is rule ownership, contact points, routes, or plan/apply flow.
- Switch to Access Management when org, user, team, or service-account scope becomes part of the incident or maintenance task.
- Switch to the Command Docs when you already know the workflow and just need the exact flags.