Introduction: Treat data movement as a continuous, ever-changing operation and actively manage its performance.
Before big data and fast data, the challenge of data movement was simple: move fields from fairly static databases to an appropriate home in a data warehouse, or move data between databases and apps in a standardized fashion. The process resembled a factory assembly line.
In contrast, the emerging world is many-to-many with streaming, batch or micro-batched data coming from numerous sources and being consumed by multiple applications. Big
data processing operations are more like a city traffic grid — a network of shared resources — than the linear path taken by traditional data. In addition, the sources and
applications are controlled by separate parties, perhaps even 3rd parties, so when the schema or semantics inevitably change, something known as data drift, it can wreak havoc with downstream analysis
Because modern data is so dynamic, dealing with data in motion is not just a designtime problem for developers, but also is a runtime problem requiring an operational perspective that must be managed day-to-day and evolved over time. In this new world, organizations must architect for change and continually monitor and tune the performance of their data movement system.
StreamSets, the provider of the industry’s first data operations platform, offers the following 12 best practices as practical advice to help you manage the performance of data movement as a system and elicit maximum value from your data.