Data puddles, ponds, lakes.. and now swamps


I love Gartner, not so much for their research but for the great terms they create to refer to some technical challenges.  It started off with data silos (departments working independently with their own data sets) then we had lakes to refer to the vast quantity of data generated by businesses, and now we have data swamps.   I must not forget ‘dark data’ too – perhaps that is a really muddy (data) swamp.  Swamp is probably a better analogy since the data lake is typically a repository of business data with little or no governance.

“Simply a data lake is an attempt to bring together physically a number (or all?) of the data stores available in such a way as they can be easily accessible to users (data scientists).  …a data lake does not support a single version or view of the truth.  It supports “any number of views of any number of truths”.  So it might be holistic in terms of data collection; but in terms of information and insight, it is fleeting and is more of a swamp.  There is value in a data lake, but it is not the same value you can get from an agreed, shared view of the truth.”

