Introduction
Data mesh has recently emerged as a new data management approach to producing business-ready data, but there is another term that is often associated with data mesh - data fabric. In this third conversation in the series of Data Automation Debates, we have covered the following topics:
- What is the difference between data mesh and data fabric?
- What is data fabric?
- Is data fabric an alternative to data mesh?
- Do I need a data fabric to implement a data mesh?
- Hints and tips
Speakers
Mike Ferguson
- CEO of Intelligent Business Strategies - an independent I.T analyst consulting and research and education company specializing in data management and analytics
- Chairman of the Big Data London
Gregor Zeiler
- CEO of biGENIUS
Conversation
Mike Ferguson (referred to as "M" below)
Gregor Zeiler (referred to as "G" below)
G: Please give us a summary about data fabric - what is data fabric from your point of view.
M: Absolutely. Data fabric is data management software platform can connect to a broad range of different data stores, and even streaming data across a distributed data estate. That could include data stores that are in multiple different clouds on premises software-as-a-service applications, also external data sources, and streaming data as well. Which is how we get the idea of what data fabric is about - to be able to spread across the kind of distributed data landscape, and reach all of the different data that companies now have in so many different places. So data fabric is a software platform that's capable of doing that, and there are some key important characteristics of the data fabric software platform. It needs to be flexible and support automated deployment. In other words, it can run anywhere, and it can run on the different clouds you run on premises, it is able to potentially reach all the way out to the edge. So we need that secure connectivity to all of those range of different data sources. In addition, data fabric has its own data catalog, or can connect to third-party data catalog products in order to support the automatic discovery of data across the distributed data estate. It also needs to support collaborative augmented development of pipelines. So it's not just for the I.T professional, it's role-based in that it can support both citizen data engineers as well as professional I.T data engineers, and allow them to collaborate on the same team to build pipelines. So you need to support modern things like data Ops, with CI/CD support, to allow you to quickly check components and automate testing, automate deployment of the testing, parses successfully, and manage all of that on behalf of the organization. Of course, it needs to be able to scale to handle large volumes of data -structured, semi-structured, unstructured data - to be able to process that, it needs to be needs to be extensible. If you have some serious complex transformations, or you need to be able to run machine learning model on unstructured data, or to pull structured data out of text, then you should be able to do that. You should be able to either write code, or include code from somewhere else, or a machine learning model for somewhere else, to be able to process this kind of data. And it should also include a unified governance of that data, so the ability to govern the data across this distributed data landscape. And finally, a kind of have a data marketplace to publish data products that we spoke about in our last meeting - the creating of data products and publishing them in a data marketplace, so that people can shop for data, and also use the marketplace to govern the data sharing around the organization. To track it so that we can see that we're getting compliant use of data around the enterprise, and to make sure that people request access to the owners of those data products if they're not immediately authorized to use it. In summary, data fabric is a comprehensive software platform for managing a distributed data estate that allows us to discover and build pipelines and produce ready-made data, and share it, as well as govern the data within that data estate.
[...] data fabric is a comprehensive software platform for managing a distributed data estate that allows us to discover and build pipelines and produce ready-made data, and share it, as well as govern the data within that data estate.
G: Can I imagine it to be a kind of a workbench for all the data engineers within a company, with all the needed tools inside?
M: I think what we're seeing is the need for these kind of project-oriented collaborative environment, perhaps different role-based interfaces within this environment - so it is a workbench. But perhaps using things like artificial intelligence to automate some of the tasks, and make it easier for citizen data engineers to engineer data without having to do everything themselves. Using AI to be able to assist them to be more productive, perhaps to automate the mappings by using recommendations, to take advantage of recommended transforms and things like that, in order to quickly and rapidly build these pipelines. And also to be able to quickly check their work and trigger automatic testing, so that they don't have to deal with the complexities of doing that.
G: I've experienced in my discussions with companies that there is sometimes confusion around the topics data fabric data mesh. Can I understand [data fabric] as an alternative, or how can I manage both data fabric and data mesh?
M: I don't think they're an alternative. I think you can use data fabric to build the pipelines to create data products in a data mesh. So that's the way I look upon data fabric. Of course, if you choose not to implement the data mesh, you can still use data fabric - you can still use it to construct other analytical systems, but clearly yes, data fabric is a platform that you can use so that multiple teams of domain-oriented data engineers can work together and build data products that can be shared around the enterprise. So absolutely, data fabric is a platform that I can use to build data products in a data mesh, but it's not an alternative to a data mesh in my opinion.
[...] you can do more with data fabric than just build a data mesh. You can also use it to govern your data within a data landscape. [...] if you're a global company and you operate in different parts of the world, you have to worry about more than just the GDPR, you also need to know about California Consumer Privacy Act (CCPA), or the Australian equivalent of this, or the Japanese equivalent of this... so we have to remain compliant with multiple different legislation in different parts of the world, and data fabric helps you to implement the governance program, and govern the data within a within a data estate.
G: Maybe I can extend that question with: do we need a data fabric to build a data mesh?
M: This is a very interesting question - if you didn't use [data fabric], what would be the alternative. The alternative is often best-of-breed tools, and that's a challenge because if multiple different teams, or data engineers use different tools, and we do not yet have an industry standard to share metadata across tools. With data fabric, we can share metadata across multiple teams and multiple data engineers, because they're all using the same platform. But in the case of best-of-breed tools makes it much harder for different data engineers to understand who's building what with different tools, because they can't see the metadata across the different tools. So it would require a lot more management of the whole program of development, a lot more people involved to make sure that we're not reinventing, but reusing if we didn't have data fabric. Also bear in mind you can do more with data fabric than just build a data mesh. You can also use it to govern your data within a data landscape. Data governance is increasingly becoming more and more important over the years. Especially as we have more data, more places where the data is stored, and we have more legislation in different parts of the world. The GDPR in the European Union of course, but if you're a global company and you operate in different parts of the world, you have to worry about more than just the GDPR, you also need to know about California Consumer Privacy Act (CCPA), or the Australian equivalent of this, or the Japanese equivalent of this... so we have to remain compliant with multiple different legislation in different parts of the world, and data fabric helps you to implement the governance program, and govern the data within a within a data estate.
G: Can I summarize it like this - when I plan to do a data mesh, it's really helpful to think in the direction of data fabric, to standardize my whole workbench - the infrastructure and everything behind it, so that it's easier to govern all those data mesh data products within my company.
M: Yes, it also can help with this federated computational data governance that's talked about in data mesh, because you can share metadata across multiple teams within a data fabric - makes that job much more easy because then if I'm able to detect sensitive data with a data catalog as part of the data fabric, multiple teams can see that, and know that they have to protect that sensitive data when building data products. And so for me, it's not just let's say, a modern name for ETL tools. Data fabric is far more than that, and much more comprehensive platform that is able to do much more than just support data engineering.
G: We should talk in the future a bit more in detail about data fabric, but for the meantime maybe you can give us a few hints and tips, where we can step in to learn more about data fabric, because it sounds very interesting for me in terms of being a success factor in implementing a data mesh.
M: I have a lot of education around these both. First of all, practical guidelines for implementing a data mesh is a two-day class, where I talk about data fabric as a very important key technology in helping build a data mesh, but I also have another class on centralized governance of a distributed data landscape, which is really about how can you use the capabilities of data fabric to solve the governance problem, not just data quality but data sharing, but data purity, data privacy, etc. are all in there. And of course, there are the book on data mesh I think may have some discussion around this from Zhamak Dehghani, and we can give you a link to that, and also there are a number of different good articles out there around data fabric that have been emerging. (Links listed below)
Full video
Links
Data Automation Debates series 1 with Mike Ferguson
Practical Guidelines for implementing a Data Mesh course
Centralised Data Governance of a Distributed Data Landscape course
The Value of Data Fabric in a Data-Driven Enterprise