ETL — Extract, Transform, Load

ETL (Extract, Transform, Load) — and its modern variant ELT (Extract, Load, Transform) — describes the discipline of moving data from operational systems (ERP, CRM, e-commerce, marketing automation) into analytical systems (data warehouse, data lake, business-intelligence tools). For ERP-centric organisations, the ETL pipeline is the bridge between transactional operations and the analytical reporting layer that drives management decisions.

ETL versus ELT

Classical ETL extracts data from sources, transforms it on a dedicated transformation server, and loads the cleaned result into the data warehouse. Tools: Informatica PowerCenter, Microsoft SSIS, Oracle Data Integrator, IBM DataStage, Talend Open Studio, Pentaho Kettle. Modern ELT reverses the order: extract, load raw data into cloud data warehouses (BigQuery, Snowflake, Databricks, Redshift), then transform inside the warehouse using SQL and tools like dbt. ELT exploits the elastic compute of cloud warehouses, simplifies tooling and accelerates iteration. Most new data-integration projects in DACH mid-market since 2020 default to ELT; on-premises and legacy environments continue with classical ETL.

Leading ETL/ELT platforms

Cloud-native ELT: Fivetran (managed connectors), Stitch, Airbyte (open source), Hevo Data. Combined with dbt (data build tool) for in-warehouse transformations. Enterprise ETL: Informatica Intelligent Cloud Services (IICS), Microsoft Azure Data Factory plus Synapse Pipelines, AWS Glue, Talend Cloud, Oracle Data Integrator Cloud, SAP Data Services and SAP Datasphere. SAP-specific: SAP Business Warehouse extractors, SAP Datasphere (formerly Data Warehouse Cloud), SLT (SAP Landscape Transformation Replication Server). Orchestration: Apache Airflow, Dagster, Prefect, Microsoft Data Factory pipelines. Open source: Apache NiFi, Apache Spark, Apache Beam, Kafka Streams.

Extracting from ERP systems

Each ERP exposes data through different mechanisms. SAP S/4HANA: standard OData services, CDS Views, SAP Datasphere, SLT for replication. Microsoft Dynamics 365 F&O and Business Central: BYOD (Bring Your Own Database), Dataverse, OData, Dual-write. Oracle Cloud ERP: BICC (BI Cloud Connector), REST APIs, ADW (Autonomous Data Warehouse). NetSuite: SuiteAnalytics Connect, SuiteTalk REST. Mid-market (abas, proALPHA, Sage X3, weclapp): database-level access plus REST APIs. Open source (Odoo, ERPNext): direct PostgreSQL access. Choosing the right extraction method affects performance, governance and the cost of every subsequent change. Avoid direct database access in production — future ERP upgrades may break the assumed schema.

Practical patterns and pitfalls

Three guidelines for ETL pipelines around ERP. (1) Capture delta, not full reload: full daily reloads of large ERP tables hit performance limits quickly. Change-data-capture (CDC) via SLT, Debezium or vendor-specific tools is the standard for table sizes above 10 GB. (2) Preserve audit context: ERP transactions include user, timestamp and document references — these matter for compliance reporting and must travel through the pipeline. (3) Document the semantic mapping: ERP table and column names are rarely self-descriptive (KNA1, BSEG, MSEG in SAP). Maintain a business-glossary mapping in tools like dbt or a data catalogue (Collibra, Atlan, Microsoft Purview) to keep downstream consumers productive. Companies that skip documentation accumulate undocumented pipelines that nobody can safely modify within a few years.

Frequently Asked Questions

Fivetran or Azure Data Factory or build it ourselves?

Fivetran (or Stitch, Airbyte Cloud) trades higher subscription cost for lower engineering effort — great for sub-50-source landscapes with standard connectors. Azure Data Factory and AWS Glue trade engineering effort for lower per-source cost — good for heterogeneous landscapes with many custom sources. Building it yourself with Airflow plus custom Python is rarely worth it below 100+ pipelines — mature open-source options dominate that scale anyway.

How much ERP data should we extract?

Start with the master data plus the past 24 months of transactions. That covers most analytical needs without pulling massive history that no one will query. Add older history only when there is a specific use case — historical comparison, audit, regulatory reporting. Keep a clear retention policy for the data warehouse aligned with GDPR and tax-record requirements.

Is real-time ERP integration possible with ETL?

Near-real-time (5-30 second latency) is achievable with CDC plus streaming pipelines (Kafka, Kinesis, Event Hubs). True real-time (sub-second) is typically reserved for operational integration (API-to-API), not for ETL into analytics. Most analytics use cases are fine with 15-60 minute refresh intervals; pushing tighter rarely justifies the engineering cost.