Job Description:
Role Summary:
We are looking for a hands-on engineer responsible for monitoring, troubleshooting, and ensuring the availability of cloud and on-prem infrastructure across Azure, AWS, Windows, and Linux environments. The role requires proactive monitoring, incident response, and strong root cause analysis skills.
Key Responsibilities
Cloud Monitoring (Azure & AWS)
Monitor cloud infrastructure using tools like Azure Monitor and Amazon CloudWatch
Configure alerts, dashboards, and health checks
Analyze metrics (CPU, memory, disk, network) and respond to anomalies
Troubleshoot VM, storage, and networking issues in cloud environments
Track cost-impacting anomalies (overutilization, idle resources)
Identity & Access Monitoring (Azure AD)
Manage and monitor Microsoft Entra ID (Azure AD)
Investigate login failures, risky sign-ins, and MFA issues
Support conditional access policies and identity security
Handle user access issues, lockouts, and permissions troubleshooting
Server Monitoring (Windows & Linux):
Monitor health and performance of:
Windows Server environments
Linux systems
Troubleshoot:
High CPU/memory usage
Disk space issues
Service failures
Perform log analysis and root cause identification
Incident Management & Troubleshooting:
Respond to alerts and incidents within SLA timelines
Perform root cause analysis (RCA) and document findings
Coordinate with application, network, and security teams
Maintain incident reports and resolution documentation
Proactive Monitoring & Optimization:
Identify recurring issues and implement preventive fixes
Fine-tune monitoring alerts to reduce noise
Automate routine checks using scripts (PowerShell/Bash)
Required Skills & Experience:
Core Technical Skills
Hands-on experience with:
Microsoft Azure
Amazon Web Services
Strong knowledge of:
VM troubleshooting (boot issues, performance, connectivity)
Storage (disks, IOPS, latency issues)
Networking basics (DNS, routing, firewall concepts)
Monitoring & Tools:
Experience with:
Azure Monitor / Log Analytics
Amazon CloudWatch
Understanding of alerts, metrics, logs, and dashboards
Operating Systems:
Strong troubleshooting in:
Windows Server
Linux
Scripting & Automation:
Basic scripting skills:
PowerShell (Windows/Azure)
Bash (Linux)
Preferred Qualifications
Certifications like:
AZ-104
AWS Certified SysOps Administrator
Experience with ticketing tools (ServiceNow, Jira)
Exposure to backup, DR, and patch management
Soft Skills:
Strong troubleshooting mindset (not just monitoring)
Ability to work under pressure during incidents
Clear communication and documentation skills