How to choose a data analytics platform
- 14 July, 2020 07:04
Whether you have responsibilities in software development, devops, systems, clouds, test automation, site reliability, leading scrum teams, infosec, or other information technology areas, you’ll have increasing opportunities and requirements to work with data, analytics, and machine learning.
Your exposure to analytics may come through IT data, such as developing metrics and insights from agile, devops, or website metrics. There’s no better way to learn the basic skills and tools around data, analytics, and machine learning than to apply them to data that you know and that you can mine for insights to drive actions.
Things get a little bit more complex once you branch out of the world of IT data and provide services to data scientist teams, citizen data scientists, and other business analysts performing data visualisations, analytics, and machine learning.
First, data has to be loaded and cleansed. Then, depending on the volume, variety, and velocity of the data, you’re likely to encounter multiple back-end databases and cloud data technologies.
Lastly, over the last several years, what used to be a choice between business intelligence and data visualisation tools has ballooned into a complex matrix of full-lifecycle analytics and machine learning platforms.
The importance of analytics and machine learning increases IT’s responsibilities in several areas. For example IT often provides services around all the data integrations, back-end databases, and analytics platforms.
Furthermore, devops teams often deploy and scale the data infrastructure to enable experimenting on machine learning models and then support production data processing, while network operations teams establish secure connections between SaaS analytics tools, multi-clouds, and data centres.
In addition, IT service management teams respond to data and analytics service requests and incidents; infosec oversees data security governance and implementations and developers integrate analytics and machine learning models into applications.
Given the explosion of analytics, cloud data platforms, and machine learning capabilities, here is a primer to better understand the analytics lifecycle, from data integration and cleaning, to dataops and modelops, to the databases, data platforms, and analytics offerings themselves.
Analytics begins with data integration and data cleaning
Before analysts, citizen data scientists, or data science teams can perform analytics, the required data sources must be accessible to them in their data visualisation and analytics platforms.
To start, there may be business requirements to integrate data from multiple enterprise systems,extract data from SaaS applications, or stream data from IoT sensors and other real-time data sources.
These are all the steps to collect, load, and integrate data for analytics and machine learning. Depending on the complexity of the data and data quality issues, there are opportunities to get involved in dataops, data cataloging, master data management, and other data governance initiatives.
We all know the phrase, “garbage in, garbage out.” Analysts must be concerned about the quality of their data, and data scientists must be concerned about biases in their machine learning models.
Also, the timeliness of integrating new data is critical for businesses looking to become more real-time data-driven. For these reasons, the pipelines that load and process data are critically important in analytics and machine learning.
Databases and data platforms for all types of data management challenges
Loading and processing data is a necessary first step, but then things get more complicated when selecting optimal databases. Today’s choices include enterprise data warehouses, data lakes, big data processing platforms, and specialised NoSQL, graph, key-value, document, and columnar databases.
To support large-scale data warehousing and analytics, there are platforms like Snowflake, Redshift, BigQuery, Vertica, and Greenplum. Lastly, there are the big data platforms, including Spark and Hadoop.
Large enterprises are likely to have multiple data repositories and to use cloud data platforms like Cloudera Data Platform or MapR Data Platform, or data orchestration platforms like InfoWorks DataFoundy, to make all of those repositories accessible for analytics.
The major public clouds, including AWS, GCP, and Azure, all have data management platforms and services to sift through.
For example, Azure Synapse Analytics is Microsoft’s SQL data warehouse in the cloud, while Azure Cosmos DB provides interfaces to many NoSQL data stores, including Cassandra (columnar data), MongoDB (key-value and document data), and Gremlin (graph data).
Data lakes are popular loading docks to centralise unstructured data for quick analysis, and one can pick from Azure Data Lake, Amazon S3, or Google Cloud Storage to serve that purpose. For processing big data, the AWS, GCP, and Azure clouds all have Spark and Hadoop offerings as well.
Analytics platforms target machine learning and collaboration
With data loaded, cleansed, and stored, data scientists and analysts can begin performing analytics and machine learning. Organisations have many options depending on the types of analytics, the skills of the analytics team performing the work, and the structure of the underlying data.
Analytics can be performed in self-service data visualisation tools such as Tableauand Microsoft Power BI. Both of these tools target citizen data scientists and expose visualisations, calculations, and basic analytics.
These tools support basic data integration and data restructuring, but more complex data wrangling often happens before the analytics steps. Tableau Data Prep and Azure Data Factory are the companion tools to help integrate and transform data.
Analytics teams that want to automate more than just data integration and prep can look to platforms like Alteryx Analytics Process Automation. This end-to-end, collaborative platform connects developers, analysts, citizen data scientists, and data scientists with workflow automation and self-service data processing, analytics, and machine learning processing capabilities.
Alan Jacobson, chief analytics and data officer at Alteryx, explains, “The emergence of analytic process automation (APA) as a category underscores a new expectation for every worker in an organisation to be a data worker. IT developers are no exception, and the extensibility of the Alteryx APA Platform is especially useful for these knowledge workers.”
There are several tools and platforms targeting data scientists that aim to make them more productive with technologies like Python and R while simplifying many of the operational and infrastructure steps. For example, Databricks is a data science operational platform that enables deploying algorithms to Apache Spark and TensorFlow, while self-managing the computing clusters on the AWS or Azure cloud.
Now some platforms like SAS Viya combine data preparation, analytics, forecasting, machine learning, text analytics, and machine learning model management into a single modelops platform. SAS is operationalising analytics and targets data scientists, business analysts, developers, and executives with an end-to-end collaborative platform.
David Duling, director of decision management research and development at SAS, says, “We see modelops as the practice of creating a repeatable, auditable pipeline of operations for deploying all analytics, including AI and ML models, into operational systems.
"As part of modelops, we can use modern devops practices for code management, testing, and monitoring. This helps improve the frequency and reliability of model deployment, which in turn enhances the agility of business processes built on these models.”
Dataiku is another platform that strives to bring data prep, analytics, and machine learning to growing data science teams and their collaborators. Dataiku has a visual programming model to enable collaboration and code notebooks for more advanced SQL and Python developers.
Other analytics and machine learning platforms from leading enterprise software vendors aim to bring analytics capabilities to data centre and cloud data sources. For example, Oracle Analytics Cloud and SAP Analytics Cloud both aim to centralise intelligence and automate insights to enable end-to-end decisions.
Choosing a data analytics platform
Selecting data integration, warehousing, and analytics tools used to be more straightforward before the rise of big data, machine learning, and data governance.
Today, there’s a blending of terminology, platform capabilities, operational requirements, governance needs, and targeted user personas that make selecting platforms more complex, especially since many vendors support multiple usage paradigms.
Businesses differ in analytics requirements and needs but should seek new platforms from the vantage point of what is already in place. For example:
- Companies that have had success with citizen data science programs and that already have data visualisation tools in place may want to extend this program with analytics process automation or data prep technologies
- Enterprises that want a toolchain that enables data scientists working in different parts of the business may consider end-to-end analytics platforms with modelops capabilities
- Organisations with multiple, disparate back-end data platforms may benefit from cloud data platforms to catalog and centrally manage them
- Companies standardising all or most data capabilities on a single public cloud vendor ought to investigate the data integration, data management, and data analytics platforms offered
With analytics and machine learning becoming an important core competency, technologists should consider deepening their understanding of the available platforms and their capabilities. The power and value of analytics platforms will only increase, as will their influence throughout the enterprise.