Splunk, a platform for searching, monitoring, and examining machine-generated big data, has launched a new release of application monitoring tool SignalFx Microservices APM™. The new release combines NoSample™ tracing, open standards based instrumentation and artificial intelligence (AI)-driven directed troubleshooting from SignalFx and Omnition into a single solution. SignalFx Microservices APM supports lightweight, open source and open standards-based instrumentation with the goal of flexible data collection designed for modern cloud environments.
Splunk also further expanded its observability offerings with a major feature release in SignalFx Infrastructure Monitoring for containerised data: Kubernetes Navigator. Kubernetes Navigator uses AI-driven analytics to surface recommendations intended to expedite triaging and troubleshooting. Workflow integration between Kubernetes Navigator and Splunk Enterprise or Splunk Cloud aims to reduce context switching and provide insights with the goal of accelerated root-cause analysis.
InfoQ asked Karthik Rau, area general manager for application management, Splunk to answer some questions relating to the new release:
InfoQ: How do teams use Splunk solutions, including SignalFx, to obtain a complete picture in hybrid environments that include legacy, heritage or cherished applications and platforms, such as SAP or mainframe alongside public/private cloud and microservices-based products?
Rau: When we look at the market, we can easily identify a trend to move workloads from private, to hybrid to public clouds; a journey to become cloud-native. We realised that there was a gap in observability for microservices-based applications, for which traditional methods of application monitoring don’t work because applications are no longer constructed as monoliths. Because of the ephemeral nature of cloud infrastructure, complex interdependencies of hundreds, sometimes thousands, of microservices, and DevOps teams release code multiple times per day, problems occur much more frequently and are much harder to troubleshoot and resolve. This new complexity frequently results in customer-impacting service outages, slowdowns and errors. To solve these problems, we took a different approach with SignalFx Microservices APM; collecting and analysing 100% of their data. This is beneficial for IT and DevOps since it means that no issue goes undetected. Once the data is collected, SignalFx uses a combination of AI and ML to connect the dots and drive relevant information to the surface, with the goal of allowing developers to spend less time searching for the source of problems and more time resolving them.
InfoQ: How can Splunk help teams gain visibility into the 4 key DevOps metrics i.e. deployment frequency, lead time (code commit to deploy in production), MTTR and change fail rate?
Rau: There are two important aspects to this: Firstly, the application delivery pipeline and secondly, as part of that lifecycle, production monitoring. Splunk Enterprise and Splunk Cloud already provide application lifecycle analytics, which provides visibility into the end-to-end development process, connecting tools across the entire development toolchain and providing visibility into the code quality and DevOps metrics. With the addition of SignalFx Microservices APM, we now provide DevOps teams with the industry’s most powerful production monitoring and troubleshooting solution for any on-premise, hybrid, or cloud application. One of the unique capabilities of SignalFx Microservices APM is the ability to collect 100% of traces, meaning that DevOps teams can, with full fidelity and extremely high levels of granularity, understand the exact behaviour of their software and accelerate deployment frequency. Combined with our streaming analytics engine, our customers can see the impact of such releases in seconds, thereby minimising Mean Time to Detect (MTTD), and act immediately. With our unique AI-Driven Directed Troubleshooting, that combs through all the traces data and automatically surfaces recommendations, DevOps teams can quickly pinpoint and resolve the root-cause of an issue significantly reducing MTTR and helping developers. Finally, we also have the ability to automate responses via our monitoring-as-code approach. We can enable DevOps teams to deploy multiple versions of code or canary releases, track the impact of each and every release, and do a roll-back if there’s a problem, with the intention of reducing change failure rate and fixing problems before they impact end users.
InfoQ: How does Splunk help teams manage flow in a value stream?
Rau: Splunk helps manage the end-to-end application (DevOps) lifecycle by monitoring the delivery pipeline and production environment, as mentioned above. In addition, with our incident response and automation capabilities, we provide open- and closed-loop capabilities for the supporting practices, especially for incident management, service level management and knowledge management. Production application monitoring is often the weakest link in the value stream, with legacy APM solutions providing limited visibility into what is actually happening with applications and end user experiences with those applications. With SignalFx Microservices APM, DevOps teams are able to correlate, understand, and quickly act on mountains of trace data to deeply understand the behaviour of their applications, instantly detect problems, and quickly resolve issues before users are affected. The level of observability that Splunk now offers means that developers spend less time troubleshooting and more time coding.
InfoQ: Can Splunk help teams calculate the value realised from a new feature and if so, how?
Rau: With SignalFx, we support custom business metrics that tie directly back to the production application so DevOps teams and business stakeholders can see how code changes can positively (or negatively) impact application uptime and user experience, and correlate that to, for example in an e-commerce application, units of good sold, in real time. This ability to track relevant business data, correlate it to application performance, and do so in real time is increasingly important for any digital initiative, especially those built around always-on online experiences. Splunk recently released a survey in conjunction with ESG that quantified the economic impact of leveraging data. The survey found that on average, companies reported a bottom-line improvement over the past 12 months of $27.6M (or a 9.1% gain in net income) directly attributable to operationalising data.
InfoQ: What are the observability challenges that microservices architectures cause and how does Splunk solve them?
Rau: Microservices have a lot of advantages in terms of scaling, time to market, among others, but they also introduce their own challenges and high degrees of complexity – the infrastructure on which they run is typically ephemeral, spinning up and spinning down very quickly, services and individual instances of services scale fast and, as their numbers multiply, the interactions between them multiply even faster, causing the amount of data to skyrocket and creating very complex interdependencies. You often have multiple versions of the same microservice running at the same time, and these versions are released sometimes several times a day. Finally, DevOps teams try to find the optimal tools and frameworks for each microservice, and as a result rely heavily on open source and open standards. In such environments, traditional APM tools miss issues because their approach to handle large amounts of data is based on sampling, manual needle in a haystack troubleshooting, they’re slow, siloed, and lock customers in with proprietary agents. On the other hand, SignalFx Microservices APM was designed specifically for microservices. We solve the challenges they introduce by ingesting and analyzing all the data, using advanced AI and streaming analytics to get insights within seconds, as well as leveraging and contributing to open standards such as OpenTelemetry, which we co-founded.
InfoQ: What are some examples of insights that Kubernetes Navigator provides?
Rau: Kubernetes Navigator provides visibility into Kubernetes environments of all sizes. With Kubernetes Navigator, DevOps teams are able to detect, triage and resolve performance issues by navigating the complexity associated with operating Kubernetes at scale. Kubernetes Navigator helps DevOps teams expedite troubleshooting and provides them with ways to instantly understand the health of Kubernetes clusters. To understand the ‘why’ behind performance anomalies, Kubernetes Navigator uses AI-driven analytics, which automatically surface insights and recommendations to answer what is causing anomalies across the entire Kubernetes cluster; nodes, pods, containers, and workloads. One such example is a noisy neighbour problem. Application workloads run on containers that are dynamically managed by Kubernetes across shared infrastructure resources. A noisy neighbour, which could be caused by a simple misconfiguration on a memory limit, could increase the memory consumption on a particular node, impacting the rest of the containers, and application workloads, on that node. This might result in end users experiencing slow performance or errors as they interact with the application. Without Kubernetes Navigator, DevOps teams would spend significant time examining individual nodes, pods or workloads. Kubernetes Navigator makes suggestions on what specific pod or workload might be causing the anomalies, with the goal of reducing triaging and troubleshooting time. A unified, correlated view across services and infrastructure can enable DevOps teams to swiftly identify what specific instance of a service is being impacted.
InfoQ: Where is the line between infrastructure and application in a product-centric, cloud and microservices world?
Rau: In order to survive and thrive in today’s increasingly product-centric world, an equal focus must be put on infrastructure and application. End user interactions are at the core of every business today, and their experiences are fragile. End users that have to wait too long for an application to load do not care whether the root cause is in the infrastructure or in the application. That’s why having a unified, full stack view of both your applications and your infrastructure, and being able to correlate the two is extremely important, and can have a direct impact on revenue, and ultimately, overall brand loyalty. Another consideration is the evolution of cloud infrastructure in the sense that it is becoming much more software-defined and ephemeral. Developers no longer need to rely on IT teams to rack and stack servers in a data centre. They can simply go to any cloud provider and, with a few simple clicks of the mouse button, provision any amount of infrastructure resources they need in a matter of minutes. They can also use serverless functions, which abstract away infrastructure altogether. This evolution of infrastructure has been critical to accelerating innovation and the delivery of software.
InfoQ: How does Splunk integrate with ChatOps and service desk or incident management solutions such as ServiceNow, Jira Service Desk or Cherwell?
Rau: Splunk’s VictorOps incident response system integrates deeply with service desks like ServiceNow as well as chat-oriented tools like Slack. Incident tickets in ServiceNow are correlated with incidents in VictorOps, and all updates and closures of tickets are synchronised between ServiceNow and VictorOps. Similarly, VictorOps integrates with Slack. When an incident is opened, a Slack channel is opened, and chat that occurs in that channel is synchronised between Slack and VictorOps. You can even use Slack commands to escalate, snooze and close events. Combined, VictorOps can synchronise across ServiceNow and Slack so operations teams and developers can chat in their preferred tool, but VictorOps is logging everything. Teams can also curate interactions between people for post incident review reporting.
InfoQ: What does being a gold member of Cloud Native Computing Foundation (CNCF) mean?
Rau: We became a gold member to demonstrate our commitment to open source and deepen our relation with the DevOps community. While Splunk has been actively involved in open source for many years with offerings and contributions to numerous projects, this commitment has accelerated with the acquisitions of SignalFx, Omnition – a founding contributor to the OpenTelemetry project, and others. Our own CNCF contributions have included projects like Cortex and Prometheus, Envoy, Fluentd and others, both as maintainers and contributors. More recently, our team is focused on bringing the OpenTelemetry project to fruition to provide developers with the most flexibility in collecting data from their applications while avoiding proprietary, heavy-weight and performance-impacting agents.
To learn more about the CNCF’s projects, review the CNCF Cloud Native Interactive Landscape.
Leave a Reply