Fast Infrastructure
Download MP3If you enjoy this content and want to
support it, go to
makeitwork.tv, join as a member,
and watch the full
conversation as a 4K movie.
You can stream it straight from the CDN
or from the Jellyfin media server.
It was summer, 5pm on a Saturday, and I
sent the following email to support
at namespace.so.
Hi, I would like to
debug a GitHub actions workflow locally.
Is it possible to run the namespace
managed Ubuntu container in Docker.
And 12 minutes later I received a reply.
Hi Gerhard, unfortunately we don't have
that possibility yet,
but it is something that we are working
on. What we often suggest to folks who
want to debug image related issues is to
rely on breakpoint action, which allows
you to stop the execution of a workflow
for debugging purposes. So where does
for debugging purposes. So where does
replying to customer support requests on
a weekend fit in your CEO role?
It's a great question and thanks for
actually reaching out because we love
working with developers and I think that
just boils down to that.
We care so much about offering great
support and we are engineers, developers
ourselves and many of these starting
points of projects also started at the
weekend. In fact, that's how Namespace
itself started. It was a weekend project.
So I have a lot of kind of a connection
with that and balancing it out with
regular life, but whenever we see a
request coming in that could
benefit from being unblocked, we try to
do that very quickly
because we care deeply about
offering great support and kind of from
an engineer to an
engineer. And that goes to
everyone in the company. It happened to
be me replying that
time, but it could have been
someone else in the team as well. That's
something that we try to embed as much as
possible in our company culture.
As first experiences go, that was a great one.
So thank you very much. And it set the
tone, I have to say,
and to this day, all my
interactions with namespace have been like this.
Whenever there's a problem, I am confident
there's someone on the other end and I
will get the help that I
need. Oftentimes something that
I didn't know. So always there's
something to learn. I know
about the breakpoint action,
for example. This is a very useful one.
And since then,
obviously, I've learned about the
NSC CLI and a couple of other things, but
it's all there and we all
meet as people and we're
all passionate about this thing because
who else, or how else could
you get this sort of interaction
5pm on a Saturday in the middle of the
summer when maybe you would be out and
about and doing things.
You know, I don't remember anymore, but
perhaps I was out and about.
It's possible, yes, you're on your phone
somewhere. And the
request came in and you're
the first one to pick it. Okay, so I know
that you have a deep love
for all things infrastructure.
And this is something that I've learned
over the months that
we've been in contact and have
two related questions. How did this love
for infrastructure start
and how does it translate
to your day to day?
I've always been fascinated by how things work
and it's very hard to put my
finger on when did that start. But I
think it started early.
I think the earliest memory that I have
was in some very distant early
years. Christmas, I got a
present, a remote-controlled car.
And I'm old, so it's like very
rickety, early stages, remote-controlled
car. And one of the first
things that I did was open it up and see
how it was inside. So I
think it's just something
wired in my head that I have just some
curiosity to how things work.
And then over time, also how
complex things work and how they are the
sum of simple things composed
together to work in concert.
And then over time, not just technology,
but also people.
They also have their own sets of
complexities and they're also systems
at work. So I think it came down from
just a natural curiosity. I got involved
with technology a
little bit accidentally.
Maybe another,
Actually, another interesting story. I
had my tonsils removed very
early on. I was four or five
years old and I have this distinct memory
of turning to the side
and seeing a screen which
was probably plotting either my heart
rate or plotting something.
And I started asking about
that screen because it's just in the haze
of going under for
surgery. It probably was the
thing that came to mind as a small kid.
And the nurse said,
"Don't worry, just calm down
and we'll show you everything about this
screen afterwards."
And they never did, but that was my
first connection with computers and the
idea of screens and things
that show up on the screen.
And a few years later, started at a time
where you would still buy
magazines that had printouts of code.
And then it's how I got introduced
into the idea that you can actually
program these machines.
And then later on, I was lucky enough
And then later on, I was lucky enough
when I was 12, I got my first
computer. But a couple years
before that, I had access to a school
before that, I had access to a school
where there was a computer
I could use. So I started
kind of playing around by myself. But
then when I got my first computer,
you start to explore, navigate,
you start to explore, navigate,
eventually the internet becomes a thing.
My first connection to the internet was
actually with a dial-up modem.
actually with a dial-up modem.
But where I lived, you didn't
have an RJ11 plug. So we had like older
have an RJ11 plug. So we had like older
plugs with three prongs. I
actually don't even know what
kind of plug it was. And I was 14, and I
got the modem for
Christmas. And it came with this RJ11
on one side. I really want to use this,
but I don't have anywhere
to plug it. And I thought,
"Well, there must just be electricity."
So I unplugged this
old-school plug from the wall at
my place. And I kind of tear apart RJ11.
And I start trying
different combinations of cables,
which probably I shouldn't have. But
eventually, I got a
dial tone. And this magical
(mimicks dial up tone)
of the modem starting to dial out, which
I had heard before. And I
was like, "Wow, this is the
beginning of something." And the
internet... Yeah, that was the thing. Did
it ever happen for you to
receive a phone call while you were
messing with wires? Well, not messing
with the wires, because...
It's that moment when you're plugging the
wires in, because that was
my moment when I realized
I shouldn't be doing that. I did exactly
the same thing. And there
was a phone call coming in,
so you get a little bit of a shock. Not
too much. But I wasn't much older than
you. And I tried the
same thing. And I remember, "Okay, so
that's why you don't mess with wires
because they're live."
And, yeah, I mean, it's not like the
voltage is very low. I forget
exactly how much it is. It's
enough to feel it, to feel the phone
call. But that was my moment when I...
Same approach. Let's
figure this thing out. Let's wire them
together. And at the same
time as I was wiring them, there
was a phone call coming in. So it got a
bit of a shock. But
nothing happened apart from that.
You were shocked twice.
I was shocked twice, yes. Once for real?
Wow, okay, I shouldn't be doing this.
So, yeah, it was no good deal.
I wasn't lucky enough to be shocked. But
it was very common that
either my mom would want to
dial out or someone would be dialing in
and it would interfere with
a connection. And that was
definitely a lot of drama around the fact
that you cannot really utilize the line.
The beginning was two phone lines. So, I
had one friend that had
two phone lines for this very
purpose. I was like, "Oh, wow, he is
living the dream." Two phone lines. One
for internet and one
for like, you know, regular phone. Yeah.
And I had a friend that had ISDN at home.
Oh, that was just...
He was rich. He was one of
the rich kids. I can tell.
That's like he lived in a part of the
city where people couldn't afford ISDN.
Now to this day, we were talking about
this yesterday, your
connection is unheard of.
I think even for most people, like what
they have at home, can you tell us a
little bit about it,
about the connection
that you have currently?
So I live in Switzerland. And there's
this fantastic ISP here called Init7
And they don't pay me
to say this. They're really
fantastic. So, they actually started many
years back when I
moved here 12 years ago.
moved here 12 years ago.
When I moved here, I already had one
gigabit symmetric. So, up and
down. And fiber to the home.
But nowadays, they have 25 gigabit
symmetric to the home.
They're nerds as well. And it's a great
company for other nerds. Obviously, I
don't utilize the full 25
gigabit per second because
it's kind of unearthing more so than
anything else. In the
office, we also have Init7. And
we do have 25. And there, sometimes we do
exercise the full 25.
But it's great. It's great.
Cannot complain.
That's amazing.
I can show you a quick demo if you want.
Yes, please. Let's see it. We have a
Chrome window here. Let's go to fast.com.
Oh, wow. That's not real. It's a bit
slow. Yeah, it is a bit
slow. No, that can't be right.
Can you try speedtest.net?
I can't believe fast.com
Oh, wow. 3.5 gigabits
per second. Yeah. Oh, wow.
So, a couple of things are
happening here. So, this is a
I'm on my Mac Studio. It has
a 10 gig ethernet connection. Then it
a 10 gig ethernet connection. Then it
goes over to an ethernet 10
gig switch as well. And then I go
to our router that has a 25 gig port. But
I actually, because I've done a few
changes, I used to have
so a part of my infrastructure at home is
fiber. I set up by me
and I used to have it like
I had a couple of racks downstairs and I
had it kind of connected
down with fiber. And I think I
damaged one of the fibers. So, I think
there is some loss. I haven't
measured it because this used
to be I used to be able to go to eight on
my Mac. But so, I think
there's actually a constraint
now that the signal is not as good. And I
haven't checked this.
Yeah, it's too slow. Right. 3.5 gigs
is too slow. I love that. Like only a
nerd would say that like,
Hey, I'm like pushing almost four
gigs per second, both up and down, but
it's too slow. This could
go faster. That's amazing.
Sometimes you want to upload something
and it's... Right. well, I
think the problem that you will
see with this, and I'm sure you have hit
it a couple of times, the
whatever you're wherever
you're uploading to, sometimes they can't
accept more than one
gigabit per second. So, sometimes
they're limited, you know, on their end,
because they don't expect
users to have this type of
setup. But that's very nice. Very, very
nice. Okay. Where I really have seen so
nowadays, I don't have
for many years, I haven't really played
any games.
many years ago, I used to play
quite a bit of Blizzard
games, so World of Warcraft,
Starcraft, and they have an installer
that internally uses,
it might even be BitTorrent,
but something like BitTorrent, at least.
So you can really get like
multi stream. Yeah. And that's
just incredible. Like you can easily use
your whole link because
you just be able to pull from
multiple sources. So for things like
that, it's, it's really you
can really tell a difference.
So how does like all this love that you
have for infrastructure
for networks for, you know,
fast things translate to namespace? We
first and foremost, build
something for ourselves. Well,
the origin of Namespace Labs, and the
name is that we were going
to build an infrastructure
company that focuses on software defined
storage, because it was kind
of a big thing that both me
and another person that is not here, HDR,
have a passion for. But
as we were building it out,
we kind of found a few
challenges along the way.
And then we moved over to build
an application platform. And as we were
doing that, we wanted to
run a lot of tests in parallel
very quickly, because we didn't want to
wait minutes for an EKS
cluster to be created, or even
a GKE cluster to be to be created a
Kubernetes cluster.
I put together
something that kind of cut through all
the layers and just focus on the
essential to start a
Kubernetes cluster really, really
quickly, because we wanted to run many of
them in full isolation
to test foundation to test this
application platform. So
that was the genesis. And it was
really for us, because we are developers
ourselves, actually,
majority of the company
is engineers, and we have an appreciation
for infrastructure
that works well, that is
understandable, and it's that it's fast.
So that's something that we try to
project into the products
that we that we build. And many of the
things that we do at Namespace, one of
our product principles
is fast is a feature. So we try to spend
quite a bit of energy on
making things as fast as possible.
Yeah. When you say Kubernetes clusters
that spin up fast, or very fast, what
does that mean to you, very fast?
It had to be seconds, like that's what
made sense. But it wasn't
just a bullish, a we need to,
this should be seconds, but it came from
the source of, can we know
how things work? So we know
how long Linux takes to boot up like the
kernel to start, we know
how long it takes for to scan
devices, we know how long it takes to
mount a file system, we know how long it
takes to start a process
we know, you know, if you kind of add all
of those things up, you get to a point
where where you start
questioning, why does it take minutes?
And so there's kind of
inefficiencies in the system.
And even today in like the Kubernetes API
server is fantastic, but
it has a few things built in
that are not kind of level triggered. So
there's kind of waiting
periods, even we wanted to make
it faster. But even to make it even
further fast, we would
have to go and change the
implementation. So it really came down to
how fast we think this
should be. And I did some kind of
back of the envelope kind of calculations.
And I said it shouldn't
take more than 10 seconds to
start a single node Kubernetes cluster.
So that was the starting point.
Creating Kubernetes clusters
from scratch, fully isolated, so not, you
know, a pod running in
another Kubernetes cluster,
but rather like a virtual machine where
you have access to, to the
kernel, you, your own kernel,
you can, you have your own Rootfs, so you
can decide what gets
packaged into it. And then that
allowed us to start kind of running more
tests and both faster, but the
main thing was the fan
out, we wanted to run many in parallel.
Yeah. So just to have a better
understanding of the scale
that we're talking about, and I'm just
looking for a magnitude,
are we talking thousands of
Kubernetes clusters? Are we talking 10s
of 1000s, hundreds of 1000s?
Like, how much are we talking
about in like, what period of time as
well, just for listeners to have an
appreciation of the scale
that this operates at?
Thinking about Namespace, we do many
millions of runs over
a short period of time. So that's kind of
the scale that we're operating.
And every single
instance is fully unique. So it's
completely new virtual machine.
Everything gets started from
from scratch, the network gets programmed
dynamically for that
instance, there's like
everything is from scratch. But when we
started, like our target
was to run like 100 Kubernetes
clusters in parallel. So you can see that
we the humbling starts and
now we have customers that
now we have customers that
start a high magnitude of
concurrent jobs. And
that's what that's even
one of our biggest challenges nowadays is
supporting that type
of performance. So very
low latency creation at a very high
concurrency, we have tenants. So that's
kind of the the unit in
our in our system that create 1000s of
jobs in an extremely small
period of time. And those
run over many, many, many machines. But
even today, if you go to a
Kubernetes cluster, and you
start the 1000 pods, which is kind of the
quick, if you define some
some kind of equivalence with
another system, you'll see how long it
takes for those posts to be created.
Because first, you need
to commit the state, you need to kind of
the scheduler needs to
design in which machine they're
going to run, then the machine needs, you
need to have IPAM. So you
need to have like an IP address
assigned to the pod. So there's kind of
many things that need to
happen. And as you scale out
the concurrency, you hit serialization
limits, because some of
these they need to be, you need
to have like a consistent view of the
universe to be able to make
a decision, like you cannot
assign the same IP address to two pods.
So you need to have some sort of
serialization. Yeah.
And so that's kind of the types of
challenges that we're tackling today.
Because when you have
a little bit of an aside,
but when you have a natural
partitioning scheme, like two customers,
for example, scaling
across customers is a little bit easier,
because you can
partition your infrastructure.
But when you go inside one customer, then
things start to become a
little bit more challenging.
And that's those types of kind of scaling
challenges that we have today.
Yeah, especially the big
customers that you mentioned that
start a lot of jobs at once. And a job,
what does a job mean?
Like, what does it translate to
in infrastructure terms? Are we talking
containers, virtual machines, how many
CPUs, how much memory,
like, what does the job look like? The
unit of compute in our
world is an instance. But that
instance is a combination of a virtual
machine. So you get full access to the
kernel and everything
in that virtual machine. But it's an
in that virtual machine. But it's an
environment that is designed to run
containers. So it's not
you don't get an Ubuntu virtual machine,
and then you go and you
know, deploy Systemd units,
that's that's not how we think about the
problem. We approach it
from we use containers as a
distribution mechanism. So you define
your application,
whatever you want to run, you
encapsulate that in a container because
it has all of the software
that you need. It also tells
us how to start it as a few other
us how to start it has a few other
properties. And and we place that
properties. And and we place that
container or multiple containers
in a virtual machine that can use
in a virtual machine that can use
an arbitrary set of
resources. So you can decide
whether it uses two or 16 CPUs, or
whether it uses, you know, two or 256
gigs of RAM. So you have like
full flexibility on that. And then also
full flexibility on that. And then also
from a network
perspective, like if you want to
interact with whatever is running in that
in that instance, you get a few
management properties out
of the box, like you can SSH in, you
don't need to configure anything,
But if you want to access
the service that you have, then you also
have primitives for that too
kind of program or ingress.
kind of program or ingress.
We say jobs, because we kind
of approach the problem in a layered way.
We think of the compute platform,
which is a little bit more
generic, as one thing, and then
applications built on top as separate as
something separate. And
a lot of our customers,
they use Namespace to run jobs. And so
those jobs are usually something that
starts, has a purpose,
wants to go really fast, that's usually
the case. And then it ends.
And it could be a GitHub job,
it could be a build guide job, it could
be a GitLab job, it could be a CircleCI job
job. But it can also
job. But it can also
be your custom job, like that you want to
be your custom job, like that you want to
run a system test. So for
example, we have customers that
deploy system tests on instances.
And they can rely
on something that scales out without
being constrained by
whatever resources that they
have available in the job where they
where they started. I think
how people deal with adversity
is very telling. When something fails,
especially when it fails,
how do you handle that tells
everything about you at many levels, as
an individual, as a
team, as a company.
And the reason why I say that is because I know
that you had the major outage
this year. And it was one of
the things that you don't expect will
happen. You prepare for it. And when it
happens, you're like,
wow, I'm so glad we had some
preparations. But it's very difficult to
simulate that. It's very
difficult to fire drill that.
It's really, really hard. So can
you tell us more about what
happened? And how did you respond?
We've been running Namespace now
for some time, so for
close to two years. And we've had our
challenges along the way. But
nothing as big as this more
recent outage. A couple interesting
things there. Like, we had two issues
that happened at the same time.
And I'm lucky that our team is
experienced, and we've operated and kind
experienced, and we've operated and kind
of supported and built
large scale systems over
our years before
Namespace. So that gives us a little
bit of preparation, like
how can things fail? That's
very often when we approach
building something. It's not just a
functionality that it has, but
something that is part of our
something that is part of our
conversation is what are the failure
modes? What if you have an
application, it's stateless,
but it pushes some state into some
database? Well, what happens
if you don't have access to that
database? What happens if you have
multiple requests going
concurrently? And you compromise
on your serializability of your
transactions? Like, how does your
application react to
potential inconsistent states that you
had to do for other reasons?
So we try to incorporate as
much as possible, like a failure mode
into how we approach
features. This big outage that we had,
it was kind of a combination of two
things. Namespace, when
we started, and we used
exclusively hardware provided by others.
exclusively hardware provided by others.
So actually, we started
with bare metal in AWS,
with bare metal in AWS,
and then we switched over to Equinix
metal, or packet. And then
metal, or packet. And then
metal, or packet. And then
we kind of worked with other
providers over time. And fairly early on,
it became obvious to us
that in order to offer
a great product that had an emphasis on
performance, we had to
have a lot more control
over the hardware, not just individual
servers, but also the
layout of the rack. So how much
network capacity there is? Do we know
that one compute node is
next to another compute node?
Is it in the same switch or not? So all
of those things started to
play a role in how we approached
play a role in how we approached
our development. And we couldn't find a good
mix that would give us both the global
reach that we needed,
because we have some customers that want
to run workloads in North
America. We have customers that
run workloads in Europe. And we realized,
well, we have to do it
ourselves. So, Namespace deploys
its own hardware, and software stack on
its own hardware, and software stack on
top of that hardware. So
that means we decide everything
from CPU, RAM, how much storage, how much
networking, what's
the layout of the rack,
how do our racks kind of the spine leaf
setup, how that is, so all
of that is done internally.
And we set ourselves in a journey to kind
of move completely to our
own hardware. And we've been on
a catch up for some time.
We had
in October, a major expansion of one of
our sites coming in, where the
distributor that we work with,
they made a mistake in their order, and
they ordered the wrong
DIMMs for those servers.
And it's a lot of DIMMs. It's not just,
you know, 20 DIMMs or 30 DIMMs that you
can go to a shop and
get. It's actually so many that they had
to go and order directly
from the source. And that added
three weeks more to that delivery. We
were counting on that
hardware, because we knew that
we were already running quite hot. So
quite hot, as in like our utilization is
high. So part of the
reason why we were okay with that is
because we have tools that
allow us to manage utilization
across sites. We can, we can run in
continuous optimizations
where we try to maintain
each site kind of hot enough, but not
more than that. So we can
more than that. So we can
kind of move things around.
but globally, because of that missed
delivery, we were running quite hot. At
the same time, one of our
existing deployments in a company that
offers kind of bare metal
that we used, they started having
an issue in their network product, which
we use to connect multiple servers
together into a single
layer two segment, where it led to
sporadic packet loss. And at first, while
the internet is built on
sporadic packet loss, so things just kind
of work. But as that became
worse over time, it was so bad
that it had a real impact into our
customers. And we interacted with that
customers. And we interacted with that
vendor and for various
reasons, they kind of acknowledged, but
they didn't react quickly enough to the
problem. So we decided
that that wasn't acceptable, the level of
service that we're offering to our
customers, the fact that
we were a source of flakes, because of
that kind of random packet
loss, it was not acceptable. So
we strategized and we made a decision on
changing our network setup
so that we wouldn't depend on
that particular feature. That meant
though, that we had a dip in our
capacity, because we had to
redeploy that part of our infrastructure
that it has to do with the
fact that we run an immutable,
we try to be very immutable. So as
machines move from one setup to another
setup, they need to be
reset up to get new keys. So there's kind
of something else that
kind of plays a role there.
And that took some time, we had practiced
that, but it took longer
than we anticipated. One of
the challenges was we rely a lot on state
that lives on individual
machines and not on the
networks to enable fast performance,
networks to enable fast performance,
bootups, etc. And distributing that
state, because we had a much
bigger fleet versus what we had done
before, for that particular region took
longer than we expected.
So it's kind of distributing all of the
state across all machines, it
highlighted a few bottlenecks
that we had. And that build up took some
time. So we were running
really, really hot for some time,
we had part of the team just trying to
support our customers,
making decisions on, okay, we're
going to move this customer to this part,
because now it's actually their peak
time. And we want to
time. And we want to
make sure that they get as good of
experience as possible. So
there was kind of part of the team
that was just trying to offer as good of
support to our customers as possible,
where the other part
of the team was just kind of rebuilding
the region. And we did it,
but it was extremely taxing.
It's primarily because we feel such a
strong commitment to the
services that we offer,
because then we've experienced,
there's something that we
depend on as a developer,
and then it's not working. It's just the
worst, right? I cannot do my job.
So I think emotionally, it was extremely
taxing. I look at it a
lot from the human side,
like you're trying to do something that
is, you're trying to do a
great job to your, you're trying
to provide a great service to our
customers, but then we let them down in
that particular moment.
And we tried to be transparent about it.
We wrote a postmortem.
The things that I mentioned
and more are there. We learned a lot in
that experience. And to
be honest with you, we were
expecting that some customers would come
to us and say that this is unacceptable
and we're moving on.
But not a single customer left due to
that outage. And we
actually got a lot of support,
and I have a big appreciation for our
customers. I met one of our
customers in San Francisco
a few weeks after the outage, and they
said, "Yeah, it was 3
p.m." And we decided, "Okay,
we're going to call it a day now, because
it seems like our jobs are not running.
But we're so happy with the service that
you folks usually provide
to us that it was one day,
and that's okay." But yeah, it felt very bad.
It wasn't a complete outage, right? It was a
degradation, a significant degradation,
but not every customer was
impacted. So this was limited
to one region. That was the blast radius,
and you have multiple
regions. So that's one. The second
one is that not all customers were
impacted the same amount, right? Because
as this was happening,
you're also moving customers off, which
I, you know, that's
something which I missed. And I will
go back to the post-mortem, by the way, I
will add a link to the show notes
The way you handled something failing, and
something failing in a very
significant way, right? The whole region
going away, or being unusable
You were able to
be hands-on, you understood how
all the pieces fit together, which means
that you were able to do
something about it, rather than
putting your hands up in the air and
saying, "Hey, it's the provider, we can't
do anything about that."
Think about what happens when you're in
AWS, or GCP, or Azure, or,
you know, one of those big
providers, what can you do? And you say,
"Well, I'm going to move things off it."
You may have so much
stuff to move off that you can't move it
off, not to mention that if
there's an outage, how are you
going to move stuff off? Especially if
the DNS is there, you can't get to the
DNS, you can't update
it. And this happens, you know, for many
companies, and many, many
businesses. And at the end of the
day, we are humans. The internet does go
down, or at least half of
it. I remember when Fastly went
down, they had an outage, or CloudFlare
went down, or Facebook
starts, you know, the BGP routes get
all messed up. Now that's bad. But in all
of this, there's always
something to learn. There's
always something to improve. And the best
approach is to get
better on the other side.
You mentioned something that I think is
really important that there's a
particular company that
we work with, and they played, like one
of their products wasn't
working to spec. But from my
perspective, that's on us. We decide who
are the companies that we
work with. And I don't even want
to throw them under the bus. Like they I
think, perhaps other
customers didn't have the same
choice. Perhaps it was the way that we
were using their
infrastructure that led to that. And it's
really on us. And obviously, when we work
really on us. And obviously, when we work
with someone, and they
can provide us great support
that helps us get the resolution faster.
Great. But in that case,
it was I felt, and the whole
team felt like a commitment to our
customers. And it doesn't
matter if it's, if it was like a
delayed delivery, or if it was a
particular upstream, or if it's a
particular provider,
that's really on us. And that's we felt
that, okay, we need to do something,
we're not just going to say,
you know, this is not usable. We
need to do something to get back to
service to our customers.
I think a lot of props should go to the
team. Obviously, I kind of
team. But I think some of these, the our
ability to handle some of these
situations is, if you
find yourself in a situation and it's
new, and you're not prepared,
it's going to be much harder.
So the more you prepare both, hey, this
disaster scenario is
possible. And maybe even just maybe
you don't even run an exercise, maybe you
just talk about a here's
what we would do so that you
just have a shared understanding of what
are the tools that would be
available to us if something
like that happens. I think that's already
the first level of
preparation. And then it goes
to the architecture. As well, we try to
to the architecture. As well, we try to
present a global service to
you so that you don't have to
think about regions and capacity and all
of that. But behind the
scenes, there is partitioning,
both for performance reasons, but also
for reliability reasons. And that
design principle also allowed us to
continue to serve our customers, even at
a very degraded state
while recovery was going on. Because from
their perspective, things
got slower, because we just
didn't have enough capacity for all
of their jobs. So how does
Namespace come together for a
user like myself, or for regular users,
I'm very keen to basically
see what it means when we use
Namespace on the command line. How does
it compare with the local
stuff that you may have running
locally? Because that's also like,
sure, run things locally. But not
everyone has 25 gigabits
at home. But even then, when you do you
want the we want the
resilience. So I'm wondering if
there's something that we can screen
share. There's something we can look at
just to see how this
comes together in practice. Yeah, I'm
happy to show you a couple things. I
would preface though with
we're very pragmatic. There are certain
we're very pragmatic. There are certain
parts of your developer workflow or
parts of your developer workflow or
certain things that you want to do
where running
them locally will always
be kind of the right thing.
We really like to think about what's the
right tool for a particular
right tool for a particular
job, where we try to excel
at is scale out. So you can get things
running really well in your
machine. But now you want to
run 1000 of them. And we could try to
find 1000 machines to run them at home.
But you just probably
that's not kind of the best, right. So we
try to apply things
where we can provide like an
where we can provide like an
non trivial amount of value where it kind
of makes sense to kind of
of makes sense to kind of
move over from whatever else that
you're doing, whether it's local
development or something else.
Then Namespace, there's,
there's different ways to approach the
product. So we're both an
infrastructure provider. So that's
where the nsc CLI comes in. But
we're also a service
provider. And I would say actually,
majority of our customers, they use our
prepackaged solutions.
So they, they want to do
Docker builds, they want their Docker
builds to be as fast as
possible. So we have kind of a
prepackaged Docker build product, they
want to run a Kubernetes
cluster really fast or 100 or 1000
or 10,000 of them, we have a prepackaged
product for that. They want
their CI runs to go really
fast, or in a very cost effective way,
actually, we start hearing a
lot more around kind of cost
management. So we have products for that
as well. We'll focus today a
bit more on the infrastructure.
So kind of under the covers. But if
you're if you want your CI
to go fast, you don't actually
have to run nsc CLI, like there's
products that make that super
easy for you.
I was thinking of starting at the origin
So, things started in kind of
building this application
framework. And this application
framework. And this application
framework, we we leaned on on something
framework, we leaned on something
akin to what we had built
at Google, which is a platform called
BOQ, where it tackles how
you write services, how do
services talk with each other? How do you
build services? How you test
services? How do you deploy
services? How you observe services in production?
To see how Namespace tests Foundation,
the open source application platform
inspired by Google's Boq,
find the YouTube video
link in the show notes.
After Hugo's demo, we look into how a
remote Docker build can
be faster than a local one.
That is a separate YouTube video,
link in the show notes.
OK, let's start wrapping this episode up.
We see more and more use cases around
kind of complex scenarios with
previews. This is an area that has also
previews. This is an area that has also
been a pain point for us.
And we want to do better.
Another thing is instances right now they're fully isolated.
So two instances, they don't share any networking.
But we...
We have a POC internally that uses tailscale
where you can connect multiple
instances. But we're also
thinking of just adding a tagged mode where you tag an
instance with kind of a network. And then
instances that are tagged the same in the
same network, they can
communicate with each
other. So you could have like a front end
that calls a back end.
Our goal is not to kind of cover all of the possible,
you know, compute use
cases, but just things that are
helpful and ideally kind of easy to use
to achieve what you want to achieve.
Typically creating a preview, you can go
all the way to a pass,
like go to a solution that
just packages everything and then you
have very little flexibility, or you can go, "Okay,
I need to do everything from scratch."
And we try to be somewhere in the middle
where you have kind of
building blocks that are
helpful but you still
can make it your own.
You can still decide what goes inside of my container?
Is it multiple containers?
Whether I want authentication, I want authentication
So all of that, it's kind
of more our mental model,
kind of our design principle of being
somewhere in the middle with
not fully packaged but also not
completely done from scratch.
As we prepare to wrap up,
one last thought,
one last takeaway from our conversation
for people that stuck with us to the end.
What would you like them to take away
from our conversation?
I was asked recently,
how does one become good at something?
And I've worked with so many engineers
that are extremely good.
And I've been looking for patterns,
like what are the things that are common
across these engineers?
And I find that it's usually some kind of
unrelenting curiosity
that really propels people beyond just
being good to being excellent.
And I think that kind of comes back to
when we approach how
we build our products
is with that same level
of unrelenting curiosity
and willingness to break
through and change things
that may help us build a better product.
And I think having that
courage has been helpful for us,
courage has been helpful for us,
but when we bring people in, try to
but when we bring people in, try to
instill that same spirit of
just go deep, read the
code, try different things,
see how it works, that just really helps
propelling us to just do better.
Well, on that note, thank you very much
for joining us today, Hugo.
I look forward to all the improvements
you'll be driving in Namespace.
I think you're on to something here.
I really like the speed, I really like
the simplicity in many ways,
and I know that behind it, there's a lot
of complexity that you need to handle
to make things this simple and this fast.
Thank you very much.
Thank you.
And I look forward to the next one.
It was a pleasure to be here. Thank you.
