DevOps Interview — Full Q&A (Detailed, Ready-to-Use): 3 years of experience

If you’re preparing for DevOps interviews, here’s a breakdown to help you answer confidently, think like an engineer, and crack real-world scenarios.

Nov 12, 2025

Below is a thorough set of technical + behavioral Q&A you can use to prepare for DevOps & Cloud interviews. Each question includes a strong, interview-ready answer and short tips — ready to speak aloud or adapt to your own experience.

1 — Intro / Elevator Pitch

Q: Tell me about yourself.
A: “Good morning. I’m a DevOps engineer with ~3 years of experience working on infrastructure automation, CI/CD pipelines, containerization, and Kubernetes-based deployments. I’m proficient with Bash and Python for automation, build Docker images and Kubernetes manifests, and manage infrastructure as code using Terraform.

In my current role I design CI/CD pipelines using GCP Cloud Build, automate infrastructure provisioning with Terraform (remote state in GCS), and deploy and operate microservices on GKE. I also handle incident troubleshooting, collaborate with developers on secret management (Secret Manager), and maintain observability with logging and alerts. I enjoy turning manual operations into repeatable, automated processes that reduce lead time and incidents.”
Tip: Keep it short (30–60s). End with a sentence about what you want next.

2 — Docker

Q: What is the difference between a Docker image and a Docker container?
A: “A Docker image is a read-only template (blueprint) built from a Dockerfile. A container is a runtime instance of an image — a running process with its own filesystem, network, and metadata. You build images (e.g., docker build), push them to a registry, and run containers from those images (docker run).”
Tip: Mention multi-stage builds and cache layers if asked for best practices.

Q: A container keeps restarting — how do you debug?
A: Steps: docker ps -a → check status, docker logs <id> → inspect logs, docker inspect <id> → view restart policy/envs/mounts, check host resources, examine healthchecks and entrypoint; reproduce locally with same env vars to debug.
Tip: Mention examining image size and entrypoint problems.

The Managerial Round

The interviewer shared valuable insights about how Rohit could improve his responses for behavioral or managerial interviews:

Avoid saying “I don’t know.” Instead, show how you would find the answer or troubleshoot.
Speak slowly and clearly. Confidence sometimes shows in pauses, not speed.
Avoid saying “I get tickets assigned.” Instead, say “I proactively pick tasks from the backlog.” It shows initiative.
Use the STAR format (Situation, Task, Action, Result) to structure behavioral answers.
Highlight impact, not activity. Instead of “I fixed a build issue,” say “I resolved a production outage impacting 10,000 users within 30 minutes.”

3 — Kubernetes

Q: How does Kubernetes know a pod is unhealthy?
A: Kubernetes uses liveness and readiness probes defined in the pod spec. The kubelet runs these probes (HTTP/TCP/command) against container endpoints. If the liveness probe fails repeatedly, kubelet restarts the container. If the readiness probe fails, the pod is removed from service endpoints so traffic stops going to it. Additionally, controllers (Deployment/ReplicaSet) compare actual vs desired state and take corrective actions.
Tip: Clarify difference: liveness = restart container; readiness = remove from load balancer.

Q: What happens when a pod dies — who brings it back?
A: The Kubernetes control plane (API server + controller manager) and the appropriate controller (e.g., Deployment → ReplicaSet) ensure desired replicas are met. If a pod dies, the ReplicaSet notices fewer running replicas than desired and creates a new pod. The kube-scheduler assigns it to a node. This is reconciliation loop behavior.
Tip: Mention node conditions and eviction if nodes are unschedulable.

Q: Pod vs Node — difference?
A: Pod = smallest deployable unit (one or more containers sharing network/mounts). Node = a worker machine (VM/physical) where pods run; runs kubelet, kube-proxy, and container runtime.

Q: Deployment vs StatefulSet — when to use each?
A: Use Deployment for stateless workloads (web apps). Use StatefulSet for stateful workloads needing stable network IDs and persistent volumes (databases). StatefulSet provides stable ordinal identities and stable storage per pod.

Q: How to investigate historical scaling events (e.g., 3 AM spikes)?
A: Check metric retention: Prometheus metrics + Grafana dashboards (query history), check cluster autoscaler and HPA events, review Kubernetes events (kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp), look at control plane logs (if accessible), check cloud provider autoscaling groups and audit logs, and examine centralized logs (ELK / Logging Explorer) for the time window. If metrics retention is low, implement longer retention or periodic snapshot exports.
Tip: Emphasize observability and retention strategy.

4 — Terraform & IaC

Q: What is Terraform state and why is it important?
A: Terraform state is a file that maps resources in your config to real cloud resources. It stores metadata and is required to plan/apply changes safely. Without state, Terraform cannot determine existing resources or compute diffs.

Q: What if the Terraform state file is lost?
A: Best-case: restore from backup or versioned remote backend (e.g., S3 with versioning or a Terraform Cloud workspace). If no backup: you must reconstruct state — import existing resources with terraform import into proper resource blocks, or destroy and recreate cautiously (risky for production). Preventive measures: use remote backend with locking (S3 + DynamoDB / GCS with locking), enable versioning, and automate periodic snapshots.
Answer style for interview: “If lost, I’d first restore from remote backend or backups. If none, I’d audit real resources and terraform import them back into state, verifying using terraform plan step-by-step. Long-term: enable remote backend + versioning + state locking.”

Q: How do you manage Terraform in a team?
A: Use remote backend (Terraform Cloud, S3/GCS), state locking, separate workspaces/environments, modules for reuse, CI pipeline to run plan on PRs and apply through an authorized pipeline, and store sensitive values in secret managers.

5 — CI/CD

Q: Describe a CI/CD pipeline you implemented.
A (structure): Source (Git) → Lint & Unit tests → Build image → Static analysis & integration tests → Push image to registry → Deploy to staging via IaC/helm → Integration/smoke tests → Manual approval → Blue/Green or Canary deploy to production → Post-deploy smoke checks and monitoring. Use Cloud Build / GitHub Actions / Jenkins as the orchestrator, and rollbacks configured (kubectl rollout undo or terraform rollback for infra changes).
Tip: Provide a concrete metric: build time, rollback time, or reduced deployment incidents.

Q: Rollback strategies?
A: Rolling back via Kubernetes (kubectl rollout undo deployment/<name>), using immutable tags and switching traffic (Blue/Green) or gradual traffic shift with Canary using service mesh (Istio/Linkerd) or load balancer. For infra, use stateful snapshots or Terraform with careful rollbacks (apply previous state). Always automate health checks and canary analysis before full switch.

6 — Cloud (GCP / AWS / Azure)

Q: How do you secure secrets in cloud?
A: Use cloud secret managers (GCP Secret Manager / AWS Secrets Manager), restrict access via IAM/service accounts with least privilege, audit access logs, and mount secrets into pods securely (CSI driver or env projected secrets). Avoid hard-coding secrets in code or Terraform. Rotate secrets and use hardware-backed KMS for high-sensitivity keys.

Q: Cost optimization strategies?
A: Right-sizing instances, using spot/preemptible VMs for non-critical workloads, autoscaling, reserved instances/savings plans for steady-state workloads, lifecycle policies for storage (cold storage), monitoring cost with budgets & alerts.

7 — Monitoring, Logging & Alerts

Q: How do you detect degraded performance in production?
A: Use a combination: application metrics (latency, error rate), infrastructure metrics (CPU, memory, disk IO), logs (structured logs to a central system), traces (distributed tracing), and alerts configured on SLO/SLA thresholds. For K8s, monitor pod CPU/req/limits, node pressure, and events. Use Prometheus + Alertmanager + Grafana or cloud native tools (Cloud Monitoring) and set on-call notifications (PagerDuty/Slack/Email).

Q: Example incident and resolution (STAR)?
A (STAR format):

Situation: Production payments service latency spiked, users couldn’t complete transactions.
Task: Identify root cause and restore service quickly.
Action: Checked Grafana dashboards → high CPU on specific pods. kubectl describe showed OOMKills/CPU saturation. Logs indicated a memory leak in a new version. Rolled back to previous deployment (kubectl rollout undo), scaled replicas temporarily, added resource requests/limits, and deployed a hotfix with memory fix to staging then to production during low traffic. Updated CI to include memory profiling tests.
Result: Service restored within 25 minutes, impact limited to 5% of users, and recurrence prevented by tests and resource limits.

8 — Security & Compliance

Q: Why not store secrets in ConfigMaps?
A: ConfigMaps are not encrypted and intended for non-sensitive configuration. Secrets should be stored in dedicated secret managers or Kubernetes Secrets with encryption at rest enabled and RBAC restricted. Use Secret Manager + workload identity or CSI providers for better security and rotation.

Q: How to secure a Kubernetes cluster?
A: Use RBAC with least privilege, enable network policies, enable pod security admission (or PSP/OPA/Gatekeeper), use private clusters, control plane logging, enable audit logs, regular image scanning, use signed images (cosign), and run containers as non-root.

9 — Git & Collaboration

Q: How do you handle merge conflicts and branching?
A: Use feature branches, small PRs, code reviews, and CI checks. Resolve conflicts locally, rebase feature branch on the target branch before merge if needed, keep commits clean, and run full CI locally (unit tests) before push.

Q: How to show initiative in interviews (communication)?
A: Use STAR stories, emphasize impact (users affected, downtime reduced, time saved), and phrase contributions proactively: “I led…”, “I implemented…”, “I improved…”, “I collaborated with…”.

10 — Common Troubleshooting Questions

Q: How do you debug a high CPU issue on a node?
A: 1) Check node metrics (top/htop, kubectl top nodes), 2) List pods consuming CPU (kubectl top pods --all-namespaces), 3) kubectl describe node and kubectl describe pod for events, 4) Inspect application logs, 5) Scale or evict pods, 6) Add resource limits/requests, and fix root cause in app code or do a rolling update.

Q: How do you debug failed Kubernetes scheduling?
A: Check kubectl describe pod for scheduling errors (insufficient CPU/memory, node selectors/taints/tolerations), inspect node capacity, check taints, check resource quotas and limit ranges in the namespace.

11 — Behavioral / Managerial Questions

Q: Describe a time you handled a production incident.
A: Use STAR — explain impact, your specific actions, collaboration, and outcome. Emphasize metrics (recovery time, % users affected) and follow-up (postmortem, long-term fix). Always show ownership and what you learned.

Q: How do you respond when you don’t know an answer?
A: “I’d explain the steps I’d take to find the answer: gather logs/metrics, reproduce in staging, consult documentation and teammates, and escalate if needed. For example, if I don’t know how an obscure API behaves, I’d test it in a sandbox and document results.”
Tip: Never say only “I don’t know.” Show investigation plan.

12 — Behavioral Tips & Common Pitfalls

Speak slowly and clearly. Pause before answers to structure thoughts.
Use STAR for behavioral answers. Have 2–3 impactful incidents ready (production incident, automation you led, cross-team collaboration).
Quantify impact. “Reduced deployment time from 20 to 5 minutes,” or “mitigated outage affecting 10k users.”
Avoid “I don’t know.” Replace with “I haven’t faced that exact scenario, but here’s how I’d approach it.”
Show proactivity. Say “I picked this up from backlog” vs “I was assigned tickets.”
Be honest about limits but follow with a clear investigation plan.

13 — Quick-fire Interview Q&A (Short answers to rehearse)

What is kubectl rollout undo used for? — Rollback a deployment to a previous revision.
What is terraform plan? — Shows proposed changes without applying them.
Why use readiness probe? — To keep unhealthy pods out of service until ready.
What is container orchestration? — Automation of deployment, scaling, and management of containers (Kubernetes).
When to use blue/green vs canary? — Blue/Green is full switch; Canary is gradual rollout for controlled exposure and metrics analysis.

30 DevOps Mock Interview Flashcards

Linux & Shell Scripting

1️⃣ Q: How do you find top memory-consuming processes in Linux?
A: Use top or ps aux --sort=-%mem | head.

2️⃣ Q: How do you monitor a specific service continuously?
A: Use a while loop with systemctl is-active <service> or watch -n 2 systemctl status <service>.

3️⃣ Q: What’s the difference between a hard link and a soft link?
A: Hard link = another name for the same inode (cannot link directories). Soft link = shortcut to the file path (can link directories).

4️⃣ Q: How do you schedule a cron job for every 5 minutes?
A: */5 * * * * /path/to/script.sh

Docker

5️⃣ Q: What’s the difference between Docker image and container?
A: Image = blueprint. Container = running instance of that blueprint.

6️⃣ Q: How do you persist data in Docker?
A: Use Docker Volumes or Bind Mounts.

7️⃣ Q: How do you inspect environment variables of a running container?
A: docker exec <container_id> env

8️⃣ Q: How do you reduce Docker image size?
A: Use smaller base images (alpine), combine RUN commands, and multi-stage builds.

CI/CD (Jenkins / GitHub Actions)

9️⃣ Q: What is the difference between Declarative and Scripted pipelines in Jenkins?
A: Declarative = structured syntax, easier to maintain. Scripted = full Groovy flexibility.

🔟 Q: How do you trigger a Jenkins pipeline automatically on code push?
A: By configuring webhooks from GitHub/GitLab to Jenkins.

1️⃣1️⃣ Q: What’s the purpose of the post block in Jenkins?
A: Defines steps that run after the pipeline (success/failure).

1️⃣2️⃣ Q: What’s Blue-Green deployment?
A: Two environments (Blue=live, Green=new). Once tested, switch traffic from Blue to Green.

Cloud (AWS / GCP / Azure)

1️⃣3️⃣ Q: What’s the difference between public and private subnet?
A: Public subnet = accessible via Internet Gateway; private subnet = internal only, via NAT.

1️⃣4️⃣ Q: What’s IAM used for?
A: Identity & Access Management — controls permissions and roles for cloud users/services.

1️⃣5️⃣ Q: What’s an S3 bucket?
A: AWS object storage service used for storing data, logs, or static website hosting.

1️⃣6️⃣ Q: How do you secure access to an EC2 instance?
A: Use SSH keys, security groups, and avoid exposing port 22 publicly (use bastion host).

Terraform

1️⃣7️⃣ Q: What is a Terraform provider?
A: Plugin that lets Terraform interact with specific APIs (AWS, GCP, Docker, etc.).

1️⃣8️⃣ Q: How do you manage secrets in Terraform?
A: Use environment variables, Vault integration, or external secret managers — not in plain text.

1️⃣9️⃣ Q: What’s terraform init?
A: Initializes a directory, downloads provider plugins, and prepares the backend.

2️⃣0️⃣ Q: What is the purpose of terraform plan?
A: Shows the execution plan before applying changes.

Kubernetes

2️⃣1️⃣ Q: What’s the smallest deployable unit in Kubernetes?
A: Pod.

2️⃣2️⃣ Q: How does Kubernetes achieve self-healing?
A: Controllers recreate failed pods automatically to match the desired state.

2️⃣3️⃣ Q: Difference between ConfigMap and Secret?
A: ConfigMap = non-sensitive config data, Secret = sensitive data (base64-encoded, encrypted at rest).

2️⃣4️⃣ Q: What’s a readiness probe?
A: Checks if a container is ready to receive traffic.

Monitoring & Observability

2️⃣5️⃣ Q: What are the 3 pillars of observability?
A: Metrics, Logs, Traces.

2️⃣6️⃣ Q: How does Prometheus collect data?
A: Pulls metrics from instrumented endpoints (via HTTP).

2️⃣7️⃣ Q: What tool do you use for visualization?
A: Grafana (connects to Prometheus/Elasticsearch for dashboards).

Git & Collaboration

2️⃣8️⃣ Q: How do you undo the last commit without losing changes?
A: git reset --soft HEAD~1

2️⃣9️⃣ Q: How do you resolve merge conflicts?
A: Pull latest, fix conflicts manually, stage, commit, and push.

3️⃣0️⃣ Q: What’s Git rebase used for?
A: To reapply commits on top of another branch, creating a cleaner history.

Feedback Summary

Here’s what the interviewer summarized for Rohit — lessons that apply to many DevOps engineers:

Technical strength is good, but soft skills and storytelling matter in managerial rounds.
Slow down your pace — clarity beats speed.
Own your projects. Use “I led” or “I implemented,” not just “I was assigned.”
Create larger-than-life scenarios. Show the impact of your actions, not just the task.
Be proactive in problem-solving. Always explain how you’d find a solution when you don’t know something.

Example for reference 👇
Situation: Jenkins pipeline kept failing during Docker image push.
Task: Automate image tagging and fix credentials issue.
Action: Debugged Jenkins logs, fixed Docker login in credentials plugin, added dynamic tagging logic using GIT_COMMIT.
Result: CI pipeline became fully automated and reduced build failures by 80%.

The Modern Stack by Amit

Discussion about this post

Ready for more?