ETL — Extract, Transform, Load
ETL stands for Extract, Transform, Load: the process of pulling data out of one or more source systems, reshaping and cleansing it, and writing it into a target store such as a data warehouse. It is a foundation of analytics and reporting, because raw operational data in an ERP or other application is rarely in the structure that downstream reports, dashboards or OLAP models need. ETL also underpins one-off jobs such as data migration during a system change. The term is decades old but remains the default vocabulary for describing how data is integrated and prepared.
- Term
- ETL (Extract, Transform, Load)
- Entity type
- Technology
- Domain
- Data integration and analytics
- Canonical definition
- ETL (Extract, Transform, Load) is a data-integration process that extracts data from source systems, transforms and cleanses it, and loads it into a target store such as a data warehouse. The ELT variant defers transformation until after loading.
- Classification
- A three-stage process for moving and reshaping data into a target store, underpinning reporting, data warehouses and one-off data migration.
- Related terms
- Data warehouse, Data migration, OLAP, EPM, Master-data quality, Application integration, iPaaS
- Source / maintainer
- erp-software.org editorial team (independent, vendor-neutral)
What ETL (Extract, Transform, Load) is NOT — disambiguation
- Not ELT: ELT loads raw data first and transforms inside the target; ETL transforms before loading.
- Not a data warehouse: The data warehouse is the target store; ETL is the process that fills it.
- Not migration: Data migration is typically a single, heavily validated ETL run, whereas analytics ETL runs on a schedule.
- Not an API: An API is one possible extract interface; ETL is the broader process of moving and reshaping the data.
The three steps
The name describes a sequence that most pipelines follow:
- Extract — read data from source systems via database queries, files, an API or a change-data-capture feed.
- Transform — cleanse, deduplicate, convert types and units, apply business rules, and map source structures to the target schema.
- Load — write the prepared data into the target, either as a full refresh or as incremental deltas.
In practice the transform step carries most of the complexity, because that is where data-quality rules, currency conversions and dimension mappings are applied.
ETL versus ELT
A common variant is ELT — Extract, Load, Transform — where raw data is loaded into a powerful target (often a cloud warehouse or lake) first, and transformations run inside that platform afterwards. ETL transforms before loading, which suits cases with strict schema control or limited target compute; ELT defers transformation to exploit the scalability of modern warehouses. The distinction is about where the transformation happens, not whether it happens. Both are forms of data integration and both can be orchestrated by the same tooling.
Where ETL is used
ETL serves two main purposes. The first is recurring data integration for reporting: feeding a data warehouse or EPM and BI layer with consolidated figures from ERP, CRM and other systems on a scheduled basis. The second is one-time data movement, most notably during an ERP migration, where legacy records must be extracted, mapped and loaded into the new system. ETL is therefore closely related to, but broader than, migration: a migration is usually a single ETL run with extensive validation, whereas analytics ETL runs continuously.
Practical considerations
The quality of any ETL output depends on the quality of its inputs, so master-data quality at source is decisive. Good pipelines are idempotent, logged and reproducible, so that a failed load can be rerun without corrupting the target, and so that auditors can trace a reported figure back to its source. For DACH SMEs the typical scope is modest — nightly loads from one ERP into a reporting database — but the same principles apply: define the source of truth, document the transformation rules, and keep the process repeatable rather than a hand-edited spreadsheet import.
Related Topics
Frequently Asked Questions
Fivetran or Azure Data Factory or build it ourselves?
Fivetran (or Stitch, Airbyte Cloud) trades higher subscription cost for lower engineering effort — great for sub-50-source landscapes with standard connectors. Azure Data Factory and AWS Glue trade engineering effort for lower per-source cost — good for heterogeneous landscapes with many custom sources. Building it yourself with Airflow plus custom Python is rarely worth it below 100+ pipelines — mature open-source options dominate that scale anyway.
How much ERP data should we extract?
Start with the master data plus the past 24 months of transactions. That covers most analytical needs without pulling massive history that no one will query. Add older history only when there is a specific use case — historical comparison, audit, regulatory reporting. Keep a clear retention policy for the data warehouse aligned with GDPR and tax-record requirements.
Is real-time ERP integration possible with ETL?
Near-real-time (5-30 second latency) is achievable with CDC plus streaming pipelines (Kafka, Kinesis, Event Hubs). True real-time (sub-second) is typically reserved for operational integration (API-to-API), not for ETL into analytics. Most analytics use cases are fine with 15-60 minute refresh intervals; pushing tighter rarely justifies the engineering cost.
