- Written by Kevin Remmert Kevin Remmert
- Created: 01 February 2016 01 February 2016
In past decades data integration when addressed well was done with ETL tools and a Data Warehouse. In the cloud people shy away from the Data Warehouse term: too limiting. Thought leaders profess new extensible concepts like a "Data Lake". ETL tools are also somewhat scorned in the cloud: instead you need a "Data Pipeline". This may be because traditional ETL tools can be deployed to the cloud but seldom are because there's no ROI benefit to move them to the cloud as with other technologies.
If the data integration tools of the past are not being adapted-to or adopted-on the cloud then is the secret data integration sauce really coming from these new Lake and Pipeline concepts. Not really. With Data Lakes the data integration workload appears to be pushed/deferred to the analytics tools with super savy analysts and their spreadsheets or with super-crazy SQL statements a la Hive or Spark. Similar complexity of data integration can be found in Data Pipelines which perform business rules in python or java and skip the visualization of a transformation.
So when I looked at what people with new cloud backend analytics platforms like Redshift or Hive and their Data Pipelines were doing for data integration I was amused at what I discovered. Less data integration across sources or cloud providers is going on with early cloud adopters of these technologies than expected when compared to the level of data integration we saw before the cloud. This is similar to what we saw in the 90s when Data Warehouses showed up on the scene before wide-scale ETL use. Specifically, in the 90s a common data warehouse paradigm was to just replicate the data model of an organization's financial system achieving little or no data integration.
Similarly, customers doing analytics in the cloud today are often supporting analytics upon a single very large streamed data source: usually their website and/or mobile app. These cloud analytics platforms usually support simple time-series based analytics as opposed to multi-dimensional, cross-subject area data exploration. Other parallels to the 90s exist with these cloud pioneers like the familiarity of NoSQL document stores today to VSAM OLTP databases back then. In both decades the data is extracted to a relational platform for single source analytics without the multi-source data integration.
Two examples of frequent data integration in the cloud has been with SFDC and Google Analytics. In each case the heavy data integration lifting is occuring into Google and SFDC. Thinking back, this is somewhat like when ERP stormed the planet in the 2000s prompting organizations to solve data integration by putting all of their corporate data in one place. The benefits of ERP were considerable, however history suggests solving the data integration challenge was not one of them.