High Availability — ERP Uptime Architectures
High availability (HA) describes ERP architectures engineered to remain operational despite component failures — server crashes, disk failures, network outages, even data-centre loss. HA-classified configurations target uptime percentages of 99.9% (43 minutes downtime per month), 99.99% (4 minutes per month) or higher. For mid-market ERP-bearing organisations, the right HA level depends on business impact of downtime, regulatory pressure and the cost of higher-tier infrastructure.
HA architecture patterns
Four patterns dominate ERP HA design. (1) Cold standby: a backup server is available but not running; in case of primary failure, the standby is started manually. Recovery time: 30-180 minutes. Cheapest, lowest availability. (2) Warm standby: backup server runs with recent data replicated; failover is scripted but not automatic. Recovery time: 5-30 minutes. (3) Hot standby (active-passive): backup server runs synchronously, with automatic failover triggered by health checks. Recovery time: 30-300 seconds. Standard for mid-market on-premises ERP. (4) Active-active: multiple servers handle load simultaneously across data centres; loss of one centre is transparent to users. Recovery time: near-zero. Required for mission-critical operations but expensive to design and operate correctly.
HA in cloud ERP
Modern cloud ERP (SAP S/4HANA Cloud, Microsoft Dynamics 365 F&O, Oracle Cloud ERP, NetSuite) delivers HA as a default of the SaaS architecture: multiple availability zones, automatic failover, transparent recovery from infrastructure incidents. Published SLAs are 99.5-99.9%. Customer-side responsibility narrows to: maintaining the application-level configuration, managing custom integrations and extensions, and accepting the vendor's maintenance-window schedule. The hard work of HA architecture moves into the vendor's operational responsibility — one of cloud ERP's most valuable but least-marketed benefits.
On-premises HA design
On-premises ERP HA requires deliberate architecture and operational discipline. Database layer: synchronous replication (SQL Server AlwaysOn, Oracle Data Guard, PostgreSQL streaming replication, SAP HANA System Replication), shared-nothing clustering, or shared-storage clustering. Application layer: multiple application server instances behind a load balancer, with session state externalised or sticky-session routing. Storage: SAN or hyperconverged infrastructure (VMware vSAN, Nutanix, Dell VxRail) with redundant paths. Network: redundant switches, redundant ISPs, redundant firewalls. Monitoring: continuous health checks with automated alerting and failover triggers. Cost: HA roughly doubles ERP infrastructure cost compared with single-server deployment, and adds 20-40% to annual operations expense.
Disaster recovery versus HA
HA protects against component failures within a data centre; DR (Disaster Recovery) protects against loss of an entire site. Key DR metrics: RPO (Recovery Point Objective) — how much data loss is acceptable, measured in minutes or hours; RTO (Recovery Time Objective) — how quickly operations must resume. For mid-market ERP, typical DR targets: RPO 15-60 minutes (asynchronous replication to secondary site), RTO 2-8 hours (scripted failover with validation steps). Mission-critical operations target RPO < 5 minutes and RTO < 1 hour, requiring synchronous replication to a geographically distant site — expensive but increasingly affordable with cloud DR services (Azure Site Recovery, AWS Disaster Recovery).
Related Topics
Frequently Asked Questions
What HA level do I need for mid-market ERP?
For most mid-market operations, hot-standby (active-passive) HA targeting 99.9% availability is the right balance — affordable, well-supported by vendors and partners, and sufficient for typical business impact of downtime. Active-active is justified mostly for 24/7 manufacturing operations and high-volume e-commerce where each minute of downtime costs measurable revenue.
Does cloud ERP eliminate the need for DR planning?
No. Cloud ERP delivers HA out-of-the-box; DR remains a customer responsibility for data exports, integration recovery, and access during regional outages. Major cloud ERP vendors offer multi-region failover as an optional premium tier. For mid-market, regular data exports plus a documented runbook for prolonged outages typically suffices.
How often should we test DR?
Annual DR drills are the standard; biannual is appropriate for mission-critical operations. Without testing, the documented procedures decay silently and reveal their failures only during actual disasters. Schedule the drill, involve the relevant teams, document the gaps, and fix them — the most consistent predictor of successful DR execution is recent DR practice.
