duquesne
Research Duquesne Advisory delivers in-depth analyses of Information and Communications Technologies, their implementations and their markets. Research is based on critical observation of the market by the analysts and their on-going contacts with the vendor community, together with hands-on, practical experience in consulting engagements.

Mission AWS 2019: “Enable our Customers to Innovate”



The title of this post reflects the key message we took away from the AWS 2019 Summit in Paris. At AWS we innovate non stop, but the real story is about helping our customers innovate.

In other words, business innovation by both private and public sector customers - transforming relationships with customers and citizens, step function jumps in operational efficiency and inventing new products and services - all enabled by digital technologies.

As always, the yearly AWS Summit roadshows are a venue for customers to learn more about those technologies, but participating every year as analysts also helps us understand better how the world’s cloud computing leader is evolving in an IT market that continues to change in sometimes unexpected ways.

Our overall “impressions” from the 2019 AWS Summit

Sustainable Innovation

Business innovation depends today more than ever on technological innovation, and there was a lot of that on display at the Summit. In 2018, AWS introduced 1957 new services or major new features for existing services, up from 1430 in 2017, 1017 in 2016…. and 80 in 2011. The result: AWS is now the most functionally rich cloud on the market.

In this context, it’s worth understanding how they innovate. According to Adrian Cockcroft, VP Cloud Architecture Strategy who did the morning overview keynote, much innovation is customer driven. The company listens to its customers and, if a request comes up that looks useful for a number of them, then AWS will do it.

Of course, that’s not the whole story. As Jeff Bezos wrote in his latest letter to Amazon shareholders: “Much of what we build at AWS is based on listening to customers. … But it’s also not enough. The biggest needle movers will be things that customers don’t know to ask for. We must invent on their behalf. We have to tap into our own inner imagination about what’s possible.

Combining “customer obsession” with engineers who “like to invent” – together with an R&D budget funded by a profitable business - is a good recipe for sustainable

An increasingly pragmatic market leader

In the past, we sometimes came away from AWS events with the impression – at some risk of exaggeration - of having heard a lot of “Tech ideology”: messages such as “everything will move to the cloud”, “data centers are doomed”, “the incumbent Tech vendors are irrelevant and will disappear” and, more recently, “serverless is the only way forward”. To be fair, Tech is a highly opinionated industry and AWS was far from alone in taking categorical positions.

That was then. This is now.

Over the last several years, the tone has changed. We started to see this in 2018 and it was clear at the 2019 Summit. AWS is still opinionated, but it has also become increasingly pragmatic, both in its messaging and in the variety of choices it brings to customers. Let’s mention two examples here:

  • Hybrid cloud

    AWS long scoffed at the notion of hybrid cloud. Nonetheless, as the need for “low network latency” applications grew, AWS put aside its earlier reticence and decided to deliver. An admirably pragmatic application of the celebrated line from John Maynard Keynes: "When the facts change, I change my opinion.”

    The company launched its first foray into hybrid in August 2017, delivering on an alliance with on-prem rival VMware to run the VMware Cloud stack on AWS infrastructure. In other words, VMware in the data center was extended up to the AWS cloud.

    Somewhat thereafter came the announcement of AWS Outposts which, according to the company, “bring native AWS services, infrastructure, and operating models to virtually any data center…for a truly consistent and seamless hybrid cloud.” Outposts are managed as part of the public cloud, with a single console. In this case, AWS was extended down into the data center.

  • Containers versus Serverless

    Over the last several years, the future of cloud application deployment has been heatedly debated, especially as monolithic applications give way to modern microservices architectures. As the influential technologist and Pivotal Field CTO James Urquhart put it: “We are witnessing the death of the server (virtual or otherwise) being a unit of software packaging and deployment.”

    No one is expecting servers to disappear but everyone knows that virtualizing an entire server is a highly inefficient way to deploy applications. In addition, VMs may be faster to provision but they still leave the developer with a number of low value infrastructure concerns, costing time which would be better spent on business logic. The ongoing debate now basically comes down to two competing options:

    • Application Containers (like Docker) which virtualize the operating system and not an entire server. Containers are light weight and highly efficient for application deployment. Most are cloud agnostic and – given the necessary APIs to the target environment - can run almost anywhere. In production, they require tools such as Kubernetes (originally created then open sourced by Google), which has become the de facto standard for container orchestration.

    • Serverless computing which entirely “abstracts away” the underlying infrastructure so that developers can focus on business logic as “functions” and (of course) how the functions interact in production as a complete application.

    Something we really liked at the Summit was the even handed and eminently pragmatic approach that AWS took to these different options. AWS is a long time advocate of serverless, and it hasn’t changed its opinion. So of course, Adrian Cockcroft gave it a thorough overview, starting with the company’s pioneering Lambda service launched in April 2015. However, he explicitly recognized that customers want choice and that many have chosen containers. That’s why AWS also offers cloud-integrated container services with both Amazon Elastic Container Service (which includes the easy to use AWS Fargate) and a managed Kubernetes container offering.

There are a variety of reasons for this increasingly pragmatic approach, but we’ll just cite here our own favorite explanation. AWS has begun to penetrate the mainstream market in a big way, winning major contracts with global companies and public administrations. (The recent contract with Volkswagen Manufacturing is a good example.) The CIOs of these organisations are not techy early adopters. They are entirely pragmatic and focused on how technology translates into business outcomes. In our opinion, the AWS organisation has learned a lot by working with and for them.

Three Summit themes for our data mad world

When Country Manager Julien Groues opened the Paris Summit, in addition to talking about revenue growth, new customers and, of course, the increasing presence of AWS offices and data centers in France, he almost casually mentioned three things - Aurora, SageMaker and data lakes – before passing the relay to Adrian Cockcroft for the traditional morning keynote.

In our data mad age, it’s not surprising that all three are basically about managing and exploiting data. These are the three things that we’ll dive into here.

Databases: you want it, we’ve got it!

When the Amazon Aurora relational database was announced in late 2015, we immediately liked it.

What’s not to like? Aurora is a fully managed “built for the cloud” standard SQL database that is compatible with MySQL and PostgreSQL, offering (according to AWS) the performance of commercial rivals at 10%-20% of the total cost. It also has synchronous replication capabilities (to another AZ for example) which can be very useful for applications needing High Availability. Finally, to make life even easier for Operations, Amazon Aurora provides a “serverless” option that scales database capacity automatically based on demand, making it a good fit for new serverless applications and other potentially unpredictable workloads.

Interestingly (and wisely) enough, AWS avoids making direct performance comparaisons between Oracle and Aurora. Oracle is the Formula 1 of databases: very fast with lots of great functions, but also expensive to license and requiring a team of skilled experts to maintain and run it. Not to mention those annoying Oracle license audits and sometimes maddening renewal negotiations. One can wonder, however, what percentage of ordinary business OLTP applications really need the levels of performance that Oracle can deliver?

Aurora has now become the fastest growing service in AWS history and it has also won industry recognition. In July 2019, the Amazon Aurora development team won the 2019 ACM SIGMOD Systems Award that recognizes "an individual or set of individuals for the development of a software or hardware system whose technical contributions have had significant impact on the theory or practice of large-scale data management systems."

For AWS customers choosing a relational database for an OLTP application, Aurora will inevitably be on the short list. Some customers however choose otherwise, often to stick with their current system, be it MariaDB, PostgreSQL, MySQL, Microsoft SQL Server or even Oracle. This is not a problem for AWS because all are available as “DB engines” within Amazon RDS, the company’s overall managed Relational Database Service.

Even so, as Jeff Bezos wrote in his shareholder letter: “we’re also optimistic about specialized databases for specialized workloads… the requirements for apps have changed.” That’s why AWS also offers Amazon DynamoDB, a NoSQL database, Amazon DocumentDB (with MongoDB compatibility) for storing collections of documents, Amazon Redshift, a petabyte-scale data warehouse service, the graph oriented Amazon Neptune and Amazon Elasticache, an in-memory cache service together with time series databases like Amazon Timestream, and ledger solutions like Amazon Quantum Ledger Database.

We could also add to the list a goodly number of non managed 3rd party data bases certified to run on AWS, including of course SAP HANA as well as AnzoGraph from Cambridge Semantics for knowledge graph based analytics.

Databases remain the workhorses of most normal business processing. Making Amazon Aurora a big success – while also making other choices available to customers - is a strategic necessity for AWS in its battle for leadership in enterprise IT.

Democratizing Machine Learning

Whatever the hype about all the great things that ML (Machine Learning, including Deep Neural Networks) can or will be able to do in the future, there are already plenty of powerful algorithms available and huge numbers of use cases. Still, the reality is that many organisations are struggling with putting Machine Learning to work for their businesses. Experienced data scientists are in short supply and many developers are unfamiliar with ML.

To put things into context, we’ll use the three level model sketched out by CEO Andy Jassy at re:invnet 2017 for the announcement of Amazon SageMaker and other ML tools.

• The bottom layer is for expert ML practitioners with complex projects. For them, AWS now supports all of the major frameworks as managed services, including TensorFlow, PyTorch and Keras, not to mention Apache MXNet, Chainer, Gluon and Horovod. Ironically enough, the widely popular TensorFlow was created by Google which put it into open source. Now about 85% of TensorFlow projects in the cloud run on AWS, according to Nucleus Research.

• The top layer is a set of pre-trained AI services allowing ordinary developers simply to “consume” machine learning. Accessible by APIs, these AI services can be easily integrated into applications to address common use cases such as Text to Voice, Recommendations, Image and Video Analysis, Forecasting, Translation and even Conversational Agents for Contact Centers.

While the underlying AI models are the same for all customers (i.e. they are pre-trained), various AI Services do offer possibilities for customization. For example, customers relying on Amazon Transcribe can add a custom vocabulary to recognize terms specific to their business or industry. Other services, like Amazon Forecast, offer more advanced customers the ability to customize the service by tuning parameters.

• The middle layer is for developers who aren’t experts in ML but want to build custom ML models into innovative business applications. Another target group is data scientists who want to spend their time on analysis and insights, but have little appetite (or perhaps even talent) for coding.

SageMaker is in this middle layer which is critical for digital transformation and where “democratization of Machine Learning” is the most urgent. According to the company, it provides “the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action.” In other words, SageMaster leverages ML functional libraries in an easy-to-deploy framework that will “enable you to move past the difficult coding and just innovate.” For an AWS customer, it’s an obvious choice and, in less than two years, SageMaker already has over 10.000 customers using it.

Even so, what about the competition? Both Google and Microsoft offer (on their own clouds of course) tools with broadly similar capabilities. In addition, Google has Cloud AutoML that works at a higher level of abstraction and enables customers to create customized deep learning models without any knowledge of data science or programming.

With AutoML, Google Cloud Plarform (GCP) seems to have taken some lead in the democratization of ML, but will that last? We asked our good friend, the distinguished and influential data scientist Carla Gentry for her take. According to Carla: “Amazon may have a hard time keeping up with AutoML at that higher level of abstraction, but I can't imagine AWS letting that stand.... so time will tell

In our view, even if GCP does manage to maintain a (time to market) lead over AWS in some Machine Learning services, it doesn’t matter very much. AWS has immensely more customers than GCP. Enterprises always take longer to absorb new technologies than vendors imagine, and there will be ample time to close any gap – with this or that ML service - if AWS customers want it.

Data lakes…not data swamps!

Background

Pentaho CTO James Dixon is credited with coining the term "data lake" in his blog back in 2010. He introduced it writing: “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

At that time, users were increasingly dissatisfied with data marts and data warehouses which
• Require lengthy IT preparation between data creation and availability for analysis
• Are limited to structured IT type data to the exclusion of unstructured data (documents…)
• Are only usable via SQL queries which are of little help in “digging down” into issues.

While Dixon did provide a carefully worded explanation, data lakes were quickly misunderstood by the market as all encompassing, “single source of truth” repositories for all structured and unstructured enterprise data, operating on the principle “throw in everything, make sense of it later”. Enthusiasm ran high and many data lake projects took off, typically on premises (sometimes in the cloud) and usually based on inexpensive but difficult to use Hadoop storage.

Within several years however alarming reports of project failures began coming in, for a variety of reasons: unclear objectives, poor governance, no “quick wins”, numerous semantic inconsistencies and lack of data integration, not to mention woefully inadequate tooling. By 2016 industry analysts were talking about data lake failure rates exceeding 65%. Many projects didn’t even make it past POC or pilot. Often large data lakes that did make it into production soon became so cluttered with voluminous inconsistent data that even experienced data scientists struggled.

So what’s the market sentiment in 2019 ?

In a recent article in DATAVERSITY, provocatively titled “Is it time to drain the data lake? ”, the expert Karthik Ramasamy wrote: “It’s no wonder that interest in data lakes rose rapidly at the same time hype around “Big Data” was exploding: a data lake seemed like the natural place to store big data, especially when you don’t know exactly what you will be doing with it. But this approach has fallen drastically short of expectations…

Feedback from conferences is also an indicator of market sentiment. According to the Analytics8 blog on the 2019 Gartner Data and Analytics Summit: “Data lakes were red hot a few years ago (but) subsequently fell out of fashion…mostly due to the simple fact that they became non-performant and unmanageable after becoming an enterprise dumping ground of data with no vision.

The report continued: “Data lake conversations, still a bit radioactive at times, were usually preceded with disclaimers and stories of lessons learned… (Still) we concur with the ‘esprit de corps’ that data lakes will be employed for the foreseeable future and have a better shot at success this time around.”

Overall we’d have to say that - despite the hopeful ‘esprit de corps’ of big data people who bet big on data lakes - current market sentiment can best be described as: confused.

AWS Lake Formation

Despite the clear commercial interest in getting users to store all their data on AWS, the company’s 2019 push into data lakes - given the experience of the last few years - does seem a bit incongruous.

AWS hosts a very large number of data lakes and, from its “front row seat”, has learned a lot about the complexities and difficulties that customers confront in building and managing these projects. In the keynote, Adrian Cockcroft was admirably upfront on the risk of complex data lakes turning into “data swamps”. Since efficient tooling is the obvious domain where AWS can help, he also emphasized the manual, complicated, and time-consuming tasks usually needed to set-up and manage data lakes. The answer to that problem was AWS Lake Formation, originally announced at re:invent 2018 and put into GA on August 8, 2019.

According to the company, “AWS Lake Formation (is) a fully managed service that makes it much easier for customers to build, secure, and manage data lakes….” Less expensive too, at least for storing massive data volumes, because they are built directly on inexpensive S3 object storage.

Again AWS: “Customers can easily bring their data into a data lake from a variety of sources using pre-defined templates, automatically classify and prepare the data, and centrally define granular data access policies to govern access by the different groups within an organization. Customers can then analyze this data using their choice of AWS analytics and machine learning services.

While we haven’t had the occasion to look at AWS Lake Formation in any detail, the new service (as described by the company) does represent considerable progress over the tooling previously available for these customer projects. The underlying approach is similar to SageMaker ML: automate the repetitive low value tasks, so that customers can just get on with business innovation.

In that context, we don’t have any real problem with Lake Formation as such, nor with the broadly similar offerings from other big players for their customers. Our real issue is more fundamental.

Is “data lake” a failed concept?

On this question, naturally we went back to expert Data Scientist Carla Gentry for her view:
Without the ability to monitor in real-time, and considering the high chance of getting lost in the clutter, I'd say the data lake’s time has past. Patterns can be found in many ways and data lakes are definitely not the best choice considering the alternatives...”

On our side we have three fundamental issues with enterprise data lakes:

• Complete centralization

By definition, data lakes are centralized repositories of all enterprise data. Whatever the advantages of a totally centralized approach to managing big data, the bigger picture is that IT is becoming much more distributed.

The move to distributed IT, despite years of pundit predictions of the inevitable future dominance by just a few hyperscale public clouds, is driven by multiple factors:
o The emergence of Edge Computing
o The stubborn preference of some enterprises for private clouds
o The availability (at long last) of industrialized hybrid clouds
o The seemingly eternal life of legacy applications best suited for operation in enterprise data centers.

We could also add that if the enterprise data lake is centralized on one public cloud, this is a real problem for a user with a multi-cloud strategy.

In addition, a totally centralized approach is not well adapted for innovative uses in behavioral and predictive analytics together with Artificial Intelligence. The new model is distributed - with some work being done in the cloud (for example, storing and processing massive data sets for training ML models), some at the Edge and some on prem for either real time performance needs or compliance when data needs to remain on premises.

• All encompassing usage

By construction, the data lake is intended to cover any and all use cases, many not yet defined. In actual practice, however, users want to have clearly defined business objectives for an analytics project, considerably more specific than “we’ll find value in your data”.

As Karthik Ramasamy wrote: “Because data lakes are best suited for scenarios where you don’t know what you want to do with your data (i.e., you’d store it until you did), they have not worked optimally in cases where you have a clear idea of what you want. This has required teams to create a separate pipeline to bypass the data lake.”

Data pipeline architectures are an interesting and above all more focused alternative to the all encompassing enterprise data lake. Of course they should not be built from scratch - there are numerous independent software companies in this market space. AWS partner SnapLogic is a well known player in data pipelines but there are others to consider. An important advantage for many customers is that most are multi-cloud.

• Lack of semantics and data interconnection

This is our biggest issue with traditional “throw everything in, make sense of it later” enterprise data lakes, whether on premises or in the cloud. It merits deep discussion but for this article we will be brief.

Data lakes accumulate large data volumes of disparate types and from various sources without context or readily discernible meaning. The meanings, in other word the semantics, are not stored with the data. At best there is a separate informal data model, but often the meanings are (for all practical purposes) only in the minds of the users and developers. Worse still, the meanings are often inconsistent. The jumbled data in data lakes lack semantic and metadata consistency, creating further ambiguity about data’s meaning, purpose and relation to other data. Just as a simple example, in a global company a data entity like “client” can have dozens of different meanings with very different attributes.

In addition, there is no mechanism for establishing persistent direct connections among closely related data. This is a huge problem for extracting insights from data, because the real intelligence is in the relationships. As Kirk Borne, Principal Data Scientist at Booz Allen Hamilton, says “The natural data structure of the world is a graph … Knowledge is about connecting the dots.”

Fortunately new technologies are emerging that respond to these needs, in particular data fabric and knowledge graph analytics. Very briefly stated, data fabric solutions provide a semantic layer derived from “ontologies” - which formally document the meanings of the data in line with semantic standards – and are deployed over existing enterprise data investments wherever they may be. They typically leverage powerful graph technologies to connect, harmonize and exploit the interconnected data. It’s worth noting that this approach protects previous customer investments, because the data fabric can be layered over existing data lakes wherever located, on premises or in a public cloud.

AWS partner Cambridge Semantic (a company we know and appreciate) is a well known player in data fabric and graph analytics, but there are of course others to consider in this still emerging market space.

Concluding remarks on AWS Lake Formation

To conclude on this theme, we would like to stress that our analysis is not a particularly negative take on Lake Formation itself. As described by the company, it does seem to represent considerable progress over the tooling previously available for customer data lake projects.

For users who still want to build enterprise data lakes, we would see the best fit with customers going “all in” with AWS (little or no need for multi-cloud data engineering) and whose data involve limited semantic complexities.

Customers with more complex needs would do well to consider alternatives to data lakes, especially things like data fabric and graph analytics. Fortunately a good many of the companies who offer these more modern and increasingly popular solutions are AWS partners!

Wrap-up

Amazon Web Services seems well positioned to retain its rank as the world’s cloud computing leader for the foreseeable future. We see the company’s capacity for sustainable innovation and its increasingly pragmatic approach to the market as very positive factors. While it does have several rapidly growing competitors, competition is good for customers, for the cloud market and for AWS itself.

As noted earlier this article makes no pretension to exhaustivity, going into detail only on three important data oriented themes - databases, Machine Learning and data lakes – where we have strong opinions.

The only omission we regret is the IoT Edge. We’ll come back to that in a later piece, because we see the Edge as a difficult and very different challenge for AWS … and indeed for all the players in the cloud market.



Tuesday, October 1st 2019
Donald Callahan
Newsletter To subscribe to the Duquesne Advisory Newsletter, please enter your e-mail address.

Duquesne Advisory

Duquesne Advisory is a European firm, dedicated to researching, understanding and advising clients worldwide on opportunities and trends in Information and Communications technology.

Research

Duquesne Advisory delivers in-depth analyses of Information and Communications Technologies, their implementations and their markets. Research is based on critical observation of the market by the analysts and their on-going contacts with the vendor community, together with hands-on, practical experience in consulting engagements.

Consulting

The analysts of Duquesne Advisory leverage the Firm’s ongoing market and technology research to undertake high added value consulting engagements for both ICT users and ICT providers. Focused on client service, their approach is rigorous and methodical, and at the same time pragmatic and operational.