Tag Archives: programming

Observing Kubernetes Services With Veneur

Veneur is a distributed, fault-tolerant observability pipeline

What is Veneur?

Veneur is a data pipeline for observing distributed systems. You can use Veneur for aggregating application and system data, like runtime metrics or distributed traces, and intelligently routing the data to various tools for storage and analysis. Veneur supports a number of these tools – called “data sinks” – such as SignalFX, Datadog, or Kafka. For this walkthrough, we’ll use Datadog.

Collecting pod-local metrics with Veneur

The first step to observing our services with Veneur is to deploy a local sidecar instance of Veneur inside our pod. To keep things simple, we’ll create a simple application whose only feature is to emit a heartbeat metric. We’ll do this using veneur-emit.

Veneur-emit is a convenience utility we wrote that allows us to emit statsd-compatible metrics at the command line for testing. It’s equivalent to using statsd client libraries (e.g. statsd) within an application. We wrote it to synthesize data transmission and is used in a similar way to netcat.. So the following netcat command:

$ echo -n “resp_success:1|c|#method:post” | nc -4u localhost 8126

could be written as

$ veneur-emit -hostport udp://localhost:8126 -count 1 -name "resp_success" -tag "method:post"

Basic Example

In this example, veneur-emit is a stand-in for any application you want to observe, if it’s instrumented using statsd client libraries. Since it’s the main application – and first container – in our pod, we’ll begin with the vanilla setup:

That’s the first container in our pod, and if we deploy it as-is, it’ll start faithfully firing off UDP metrics on port 8126. But there’s nothing listening on that port, so it’s just talking into the void. Let’s give it an audience by adding a second container in the same pod:

- name: veneur
    image: index.docker.io/stripe/veneur:3.0.0

Veneur requires almost no configuration out of the box; it defaults to working values. This new container is almost ready to start collecting metrics. However, without a Datadog API key, it won’t be able to send the metrics anywhere, so at the very least, we’ll need to configure our downstream data sinks.


Configuring a Pod-Local Veneur Collector

When running in non-containerized environments, Veneur can read its configuration from a file. But in Kubernetes, environment variables are more convenient. Thanks to envconfig, we can use these two forms of configuration interchangeably. Every config option specified in the yaml config example has a corresponding environment variable, prefixed with VENEUR_. So, we provide the following:

              name: datadog
              key: datadog_api_key

For more information on setting the secret key values in Kubernetes, see the Kubernetes documentation on secrets.

In addition, we listen for UDP packets on port 8126 to receive the metrics, and we also listen for HTTP traffic on port 8127. Port 8127 is used for the healthcheck endpoint. It can also be used to listen for metrics that are emitted over TCP using authenticated TLS instead of UDP, but we won’t cover that here, as it’s not needed for most Kubernetes deployments.

So, putting it all together, we have:

If we apply this deployment (`kubectl apply -f veneur-emit.yaml`), we’ll start to see these metrics come in, at the rate of six per minute. We’re sending metrics all the way from our application to Datadog!

If all you care about is collecting pod-local metrics, that’s it! You can stop here. But, there’s a good chance you want global metric aggregation as well. For that, we’ll have to add a little more.


Global Aggregation – How it Works

Veneur is capable of supporting global metric aggregation.  

Let’s say you have an API server with 200 different replicas responding to stateless requests, load-balanced evenly across all instances. If you want to reduce your API latency, you probably want to know the 95th or 99th percentile for API latency across all 200 replicas, rather than the 99th percentile latency for the first pod, the 99th percentile for the second, and so forth. Having the separate latencies for the individual pods doesn’t help you, because percentiles aren’t monoids: there’s no mathematical operation to construct the global 99th percentile from the 200 different measurements of the 99th percentile on each pods.

Some time-series databases sidestep this problem by storing the raw data and performing the percentile aggregation at read-time, but this requires tremendous network, disk, and compute resources at scale, especially if we want good performance from our queries at read-time. Performing write-time aggregation reduces this load. But calculating percentiles before writing the results requires either sending all metrics to a single aggregator pod (which becomes a single-point-of-failure) or calculating them on a per-pod basis. Fortunately, Veneur has a nifty approach that gives the best of both worlds: global aggregate calculations at write-time, with no single-point-of-failure, and with an architecture that scales horizontally using existing Kubernetes primitives.

Veneur uses t-digests to compute extremely accurate percentile estimates in an online manner, using constant memory. While these percentile calculations are technically approximations, in practice, the error properties yield exact results for tail ends of the distribution, like the 99th percentile. Conveniently, that’s generally what we care about when observing our services, especially latencies.

Veneur’s provides a horizontally scalable mechanism for calculating global aggregates of timers, histograms and sets. The pod-local metrics – counters and gauges – are shipped off to Datadog immediately, but timers, histograms, and sets are forwarded to a separate service for global aggregation. The aggregation happens in two layers.


Veneur Proxy and Veneur Global

First, the pod-local veneur processes forward all timers, histograms, and sets to the veneur-proxy service, which is exposed by Kubernetes. Because Veneur is stateless, the choice of pod and node are arbitrary for the veneur-proxy service, and this can be handled by the built-in Kubernetes support for service discovery and load-balancing.

Under-the-hood, the veneur-proxy processes have to do a bit of coordination to split up their metrics properly. Assume that we have three proxy pods and two global pods, and that each proxy pod receives three metrics – named “red”, “blue”, and “pink” for convenience – from an arbitrary number of application pods. We can’t just lean on regular load-balancing within Kubernetes – we need to ensure that all metrics of the same name end up on the same global pod, or else we won’t have aggregated them properly.

Fortunately, this is all handled automatically by the veneur-proxy and veneur-global services. The veneur-proxy pods distribute the metrics consistently – all red metrics (the solid red arrow) end up on the same global pod, all blue metrics (blue dotted arrow) end up on the same global pod, and all pink metrics (dashed pink arrow) end up on the same global pod.

Of course, the number of replicas of the veneur-proxy and veneur-global services is arbitrary (and each veneur-global pod is capable of aggregating far more than one metric at a time!). Since both services are stateless, they can be horizontally scaled and adjusted dynamically to adapt to your particular metric load.

To create the veneur-proxy and veneur-global services, we just need to provide the following definitions.

For the proxy:

The second section is the same as what we saw before – veneur-proxy is a separate application binary, and for it to emit metrics about itself, we’ll want it to have its own, pod-local Veneur instance to talk to.

The global service configuration is even easier:

Resource Limits

While Veneur’s functionality extends beyond just collecting runtime metrics, compared to other dedicated metric collectors, Veneur’s resource usage is pretty lightweight. The portion that’s the most expensive computationally – global metric aggregation – is performed on dedicated pods (veneur-global).

Kubenetes has built-in support for resource limits, leveraging the power of cgroups in a convenient abstraction. We can use resource limits to cap the CPU and memory usage of the pod-local Veneur instances.

- name: veneur
  image: index.docker.io/stripe/veneur:3.0.0
          cpu: 25m


With this three-tier setup, we maintain a number of useful invariants about our observability system.

  • Stateless. Each Veneur process is entirely stateless. No data remains in memory for longer than ten seconds, so failure or eviction of a single pod running Veneur has very limited impact on our system’s overall observability.
  • Distributed: There is no single point of failure. If any pod-local veneur process is killed and restarted, we only lose data for that pod, and only for one flush cycle. If any proxy or global veneur process is killed and restarted, we only lose ten seconds of data, and only for 1/n metrics (where n is the number of replicas we’re running).
  • Horizontally scalable: We can run as many instances of the proxy and global boxes as we need to, to support arbitrarily large loads.
  • Fault-tolerant: If a global veneur instance becomes unreachable, all proxy boxes will update their hashing scheme immediately for the next flush cycle.


Veneur also supports pull-based metric aggregation with Prometheus. veneur-prometheus is a separate binary which polls a Prometheus metrics endpoint to ingest metrics into the Veneur pipeline.

….and more!

We’re just scratching the surface of how Veneur can help you get better observability into your Kubernetes systems. Observability is not just about runtime metrics – it’s about request traces and logs as well. Veneur is an embodiment of the philosophy that logs, runtime metrics, and traces are all deeply connected, and that there should be an intelligent metrics pipeline handling all three.

If any of this excites or intrigues you – or if you have more thoughts on what kinds of visibility you need in your Kubernetes deployments, you’re in luck! Veneur is actively developed, with tagged major releases every six weeks. Drop us a line over on the issue tracker!


Thanks to Jess Frazelle, Kris Nova, and Cory Watson for reading drafts of this post!

New York City Office Skyline

Remote Work: 2 Years In

I’ve been working for Stripe for over two years now. The company is based in San Francisco, and I’ve been working remotely from New York City the entire time. When I was deciding whether to accept my offer, I spoke with a number of people to get a feel for the company, the work I’d be doing, and the experience of working remotely.

One of the people I talked to was Julia Evans, who had conveniently written two great blog posts about her experiences working remotely, both three months in and eight months in. With remote work becoming more popular at companies, I’d like to share my experiences of remote work after two years.

I’ve experienced a wide range of remote working configurations: For the first year, I was working at a coworking space with a few other people from the same company. For logistical reasons, we ended up moving out of the space that we worked in together, so since then, I’ve been in a different coworking space by myself. During all of that, I’ve also spent some days here and there working from other places: my apartment, working from a house where I’m visiting friends or family, and working out of our international offices.

Based on that, and talking to friends who work remotely at other companies, here are some things I’ve learned for myself about working remotely. Your mileage may vary, but it’s what works well for me. Remote works has both pros and cons, and it might not be for everyone, but at this point, I’ve found I now actually prefer it to working in a traditional office.

Don’t work from home

Working remotely doesn’t mean you have to work from your couch or your bedroom. Especially if, like me, you live in a major city and have a small apartment, I’d strongly discourage it. Going to a coworking space every day, even if I’m the only one from my company working there, provides a mental transition in my day. When I’m there, I’m in work mode, and it’s easy to focus with no distractions. I’m not averse to working from home on occasion if I need to, but if I had to do it more than a day or two in a row, it’d be really stifling. (If you’re in New York or the tri-state area and are looking for coworking space recommendations, drop me a line – I’ve worked at a number of different ones over the years, and I can give you a good run-down of the pros/cons of each).

If you do decide to work from home, I would recommend a few things. First, set up a dedicated home office. Ideally, this is a separate room that you use only for work. At the very least, put up a curtain or folding partition to give some physical separation. Don’t use this space for anything else – if you’re a PC gamer, for example, put your gaming rig in a different room. If you live with others (roommates, significant others, kids, or pets), having a physical space that says “when I’m here, treat me like I’m at the office and I’m not at home” makes things easier.

Secondly, make sure you leave the house! Cabin fever is real. Give yourself reasons to leave the house at the beginning and end of your day. Going to the gym, getting coffee, walking the dog – make these part of your routine. One person I know would leave the house every morning, walk clockwise around the block, and call that his “commute” to the office. In the evenings, he’d repeat the process, but counter-clockwise. While it might seem a little artificial, it makes a difference!

Oh, and, if you’re ever being interviewed on live TV, make sure that the door locks.

The team matters

You don’t have to be on an all-remote team to have a good experience working remotely, but the rest of your team has to be on board with supporting remote-friendly workflows. That means making sure all meetings include video conference links, so you can participate. Or using asynchronous communication (email, JIRA, etc.) to document state which otherwise would only be exchanged verbally. In a pinch, even Twitter can work – whatever you do, just make sure that the rest of the team either already has remote-friendly habits or is willing to develop them. (Remote-friendly workflows are also beneficial even on teams that are 100% co-located, so it’s to their benefit to be doing this anyway, but that’s the topic of another post!).

This applies to the company as well. If your company isn’t committed to remote-friendly workflows, or to remote work in general, you’ll have a much harder time. You don’t need the company to be 100% distributed – or even majority distributed. If you’re considering working remotely for a company, ask about how many other people work remotely, and how the company ensures that those people are integrated into their workflow. If there aren’t any, that’s fine – someone has to be first! – in which case you’ll want to ask why they’re looking to hire people remotely, and what parts of their company workflow, policies, and structure they anticipate having to modify to transition into a remote-friendly company. (There’s no specific “right” answer here; you’re looking to see what their planning process is like).

Timezones require some effort, but aren’t too hard

On my team, I’m the only person on Eastern time. One is on Central time, and the rest are on Pacific. Most engineers on other teams are also on Pacific time. At first glance, that sounds like it’d be hard – I’m 3 hours offset from most of the people I work with – but in practice, it’s not been too difficult after some adjustment. In practice, this divides my day in two: most of my meetings are after lunch, which leaves my mornings free to focus with uninterrupted time.

I still have approximately the same number of meetings per week as my teammates, so the downside is that this means my meetings are compressed into the afternoon hours. This can be tiring on a meeting-heavy day, but I’ve gotten accustomed to it.

The timezone difference is probably the biggest change between my first few months working remotely and my current experience. In the first couple of months at any job, you’re more likely to get stuck on something unfamiliar, or need to request access, or somehow involve another person in your work process. If you bump into one of these early in the day, it means you’ll need to wait until after lunch to get unblocked. The workaround is to make sure you have a few different projects in parallel that you can work on, so that getting blocked on one doesn’t kill your morning. Fortunately, this stage doesn’t last too long. Within a month or two, I found this happening less often – at least, no more often than I might get blocked on someone who worked in the same physical office.

Normally I work out of New York, but I spent a bit of time working from London and Singapore when I was traveling there for conferences. Working from London and Singapore was definitely harder, because there’s almost no overlap between normal working hours in California and London or Singapore. In both cases, it was short-term (less than a week), which meant that we could manage it with some careful planning (choosing projects that are encapsulated for that week). It would be harder to manage that longer-term without having at least some other engineers on my team working from an overlapping timezone.

Use a headset for video conferencing

For a long time, I used the built-in microphone for my laptop (along with headphones and an external webcam). Getting an all-in-one headset made a huge difference, both for me and my teammates. Between the latency and the (lack of) noise correction, using the built-in microphone meant a lot of slightly awkward pauses, or accidentally interrupting because the other person had started speaking, but I couldn’t hear it yet. With an all-in-one headset, the microphone is much closer to the source (your voice) and it doesn’t pick up the sound from your headset speakers the way your laptop microphone does when you use it without headphones. This makes the noise correction works much better, giving the illusion of lower latency, and so it feels a lot more like having a natural conversation. (I have the Jabra UC Voice), but I think any dedicated computer headset will work better than the built-in laptop microphone and speakers).

The microphone on the other end matters as well. At our office, the microphones in the conference room have noise detection that’s eerily powerful. When I first started, it took some getting used to – I kept thinking that the audio on the other end had dropped, when in reality, nobody was speaking, and the background noise was getting filtered out! It’s not perfect, but it’s better than using Hangouts or Skype between two laptops.

Visit other coworkers often

Between Slack/IM and video conferencing, it’s easy to forget that I’m 3,000 miles away from my teammates. I’m good friends with all of them, and we chat a lot throughout the day. Seeing your coworkers regularly in-person is important.

The exact cadence depends on your personal situation and also your team layout – for a team that’s mostly distributed, you can probably get away with less frequent visits. For me, going to our headquarters every three months at a minimum helps me feel connected, not just to my teammates, but to coworkers on other teams that I work with less frequently.

At first, visiting the HQ office as a remote worker is a weird experience. I end up spending a lot of my time meeting with people, mostly in unstructured meetings (“coffee walks”, etc.).Compared to when I’m working the rest of the time, I spend a lot less time writing code. The best way I’ve found to approach it is to remember that: when I’m working remotely, I don’t have as much face time with my coworkers as they do with each other, and my visits to the office are my time to pay down that “debt”.

I think that one of the reasons working remotely works well for me is that my personal interest rate on this debt is low. I’m able to get most of what I need from email, slack, or video conferencing, and for the rest, I’m able to catch up quickly enough from in-person visits. On the flip side, context-switching is expensive enough that I’d rather group my “catch-up” time into a fixed period than interleave it throughout my normal daily routine.

This might sound tiring if you’re introverted – I’ll get to this later – but it doesn’t need to be. Don’t overdo it. Just do what feels natural, and don’t worry if it feels like you’re less productive when you’re visiting. Remember: productivity isn’t just measured by lines of code, and the time you spend building relationships with your coworkers is part of your work.

Also, note that I said “other coworkers”, not “other offices”! The tech lead for our team also works remotely, and we’ve met up in other cities as well – he happened to be speaking at a conference in New York (where I live), and we both went to a conference in Portland together. As I write this, I realize that he and I have actually never spent any appreciable amount of time together in the San Francisco office, but I’ve still never felt a lack of relatedness, because I still see and interact with him enough.

At the end of the day, you’re optimizing for relatedness – the feeling of being connected to other people and caring about them – not the number or frequency of your in-person interactions. Relatedness is “the universal want to interact, be connected to, and experience caring for others” – some people need this to happen in-person to satisfy their need for human interaction, but others may not. If you’re able to satisfy this for yourself through virtual interactions, you can probably get away with meeting in-person less frequently (or even at all – I know a couple of people who are totally fine never meeting their coworkers in-person, but I think that’s rare).

Schedule recurring pair-programming meetings

I’d strongly recommend this for anyone, whether or not they’re working remotely, but it’s particularly useful for remote engineers. I have standing weekly pairing sessions with four people on my team. Ideally I could schedule a weekly pairing session with everyone, but there are only so many hours in the day!

There are a lot of benefits to regular, scheduled pairing (as opposed to ad-hoc pairing, or no pairing at all), but for now, I’ll just talk about the part that’s relevant to working remotely. When working remotely at a company that is not 100% remote, you have to make an active effort to make sure that your non-remote coworkers are aware of you, the work you’re doing, and your areas of expertise. Regular pairing sessions are a good forcing function for that: on any given week, whether you’re working on a project of yours or a project of theirs, you’re getting “face time” with your coworkers, and sharing knowledge about each others’ projects.

Introversion or extroversion

I’ve heard a lot of people argue very strongly that remote work is best for introverted people, because you spend most of your days alone. I’ve heard others argue that it’s best for extroverted people, because you have to spend more effort getting to know people across the company when you’re not physically present.

I don’t think it matters. I know people who work remotely with success who are introverted, extroverted, or ambiverted. People will adjust their work and life to match their personal styles wherever they are working, remotely or not. If you discover you need more social interaction, you’ll find ways to increase your interpersonal interaction with your coworkers, or to do that outside work. If you discover you need less, you’ll find ways to reduce it, by relying less on synchronous communication or adjusting your work schedule. Being remote isn’t only for introverts (or extroverts), the same way working in an office isn’t only for extroverts (or introverts).

If you’re working remotely (or considering working remotely), I’d love to talk more about this with you and exchange tips. Ask me on Twitter or drop me an email. Oh, and I just found out that we have a conference now too!

Thanks to Julia Evans and Karla Burnett for reading drafts of this post!