Fast Infrastructure

Download MP3

If you enjoy this content and want to

support it, go to

makeitwork.tv, join as a member,

and watch the full

conversation as a 4K movie.

You can stream it straight from the CDN

or from the Jellyfin media server.

It was summer, 5pm on a Saturday, and I

sent the following email to support

at namespace.so.

Hi, I would like to

debug a GitHub actions workflow locally.

Is it possible to run the namespace

managed Ubuntu container in Docker.

And 12 minutes later I received a reply.

Hi Gerhard, unfortunately we don't have

that possibility yet,

but it is something that we are working

on. What we often suggest to folks who

want to debug image related issues is to

rely on breakpoint action, which allows

you to stop the execution of a workflow

for debugging purposes. So where does

for debugging purposes. So where does

replying to customer support requests on

a weekend fit in your CEO role?

It's a great question and thanks for

actually reaching out because we love

working with developers and I think that

just boils down to that.

We care so much about offering great

support and we are engineers, developers

ourselves and many of these starting

points of projects also started at the

weekend. In fact, that's how Namespace

itself started. It was a weekend project.

So I have a lot of kind of a connection

with that and balancing it out with

regular life, but whenever we see a

request coming in that could

benefit from being unblocked, we try to

do that very quickly

because we care deeply about

offering great support and kind of from

an engineer to an

engineer. And that goes to

everyone in the company. It happened to

be me replying that

time, but it could have been

someone else in the team as well. That's

something that we try to embed as much as

possible in our company culture.

As first experiences go, that was a great one.

So thank you very much. And it set the

tone, I have to say,

and to this day, all my

interactions with namespace have been like this.

Whenever there's a problem, I am confident

there's someone on the other end and I

will get the help that I

need. Oftentimes something that

I didn't know. So always there's

something to learn. I know

about the breakpoint action,

for example. This is a very useful one.

And since then,

obviously, I've learned about the

NSC CLI and a couple of other things, but

it's all there and we all

meet as people and we're

all passionate about this thing because

who else, or how else could

you get this sort of interaction

5pm on a Saturday in the middle of the

summer when maybe you would be out and

about and doing things.

You know, I don't remember anymore, but

perhaps I was out and about.

It's possible, yes, you're on your phone

somewhere. And the

request came in and you're

the first one to pick it. Okay, so I know

that you have a deep love

for all things infrastructure.

And this is something that I've learned

over the months that

we've been in contact and have

two related questions. How did this love

for infrastructure start

and how does it translate

to your day to day?

I've always been fascinated by how things work

and it's very hard to put my

finger on when did that start. But I

think it started early.

I think the earliest memory that I have

was in some very distant early

years. Christmas, I got a

present, a remote-controlled car.

And I'm old, so it's like very

rickety, early stages, remote-controlled

car. And one of the first

things that I did was open it up and see

how it was inside. So I

think it's just something

wired in my head that I have just some

curiosity to how things work.

And then over time, also how

complex things work and how they are the

sum of simple things composed

together to work in concert.

And then over time, not just technology,

but also people.

They also have their own sets of

complexities and they're also systems

at work. So I think it came down from

just a natural curiosity. I got involved

with technology a

little bit accidentally.

Maybe another,

Actually, another interesting story. I

had my tonsils removed very

early on. I was four or five

years old and I have this distinct memory

of turning to the side

and seeing a screen which

was probably plotting either my heart

rate or plotting something.

And I started asking about

that screen because it's just in the haze

of going under for

surgery. It probably was the

thing that came to mind as a small kid.

And the nurse said,

"Don't worry, just calm down

and we'll show you everything about this

screen afterwards."

And they never did, but that was my

first connection with computers and the

idea of screens and things

that show up on the screen.

And a few years later, started at a time

where you would still buy

magazines that had printouts of code.

And then it's how I got introduced

into the idea that you can actually

program these machines.

And then later on, I was lucky enough

And then later on, I was lucky enough

when I was 12, I got my first

computer. But a couple years

before that, I had access to a school

before that, I had access to a school

where there was a computer

I could use. So I started

kind of playing around by myself. But

then when I got my first computer,

you start to explore, navigate,

you start to explore, navigate,

eventually the internet becomes a thing.

My first connection to the internet was

actually with a dial-up modem.

actually with a dial-up modem.

But where I lived, you didn't

have an RJ11 plug. So we had like older

have an RJ11 plug. So we had like older

plugs with three prongs. I

actually don't even know what

kind of plug it was. And I was 14, and I

got the modem for

Christmas. And it came with this RJ11

on one side. I really want to use this,

but I don't have anywhere

to plug it. And I thought,

"Well, there must just be electricity."

So I unplugged this

old-school plug from the wall at

my place. And I kind of tear apart RJ11.

And I start trying

different combinations of cables,

which probably I shouldn't have. But

eventually, I got a

dial tone. And this magical

(mimicks dial up tone)

of the modem starting to dial out, which

I had heard before. And I

was like, "Wow, this is the

beginning of something." And the

internet... Yeah, that was the thing. Did

it ever happen for you to

receive a phone call while you were

messing with wires? Well, not messing

with the wires, because...

It's that moment when you're plugging the

wires in, because that was

my moment when I realized

I shouldn't be doing that. I did exactly

the same thing. And there

was a phone call coming in,

so you get a little bit of a shock. Not

too much. But I wasn't much older than

you. And I tried the

same thing. And I remember, "Okay, so

that's why you don't mess with wires

because they're live."

And, yeah, I mean, it's not like the

voltage is very low. I forget

exactly how much it is. It's

enough to feel it, to feel the phone

call. But that was my moment when I...

Same approach. Let's

figure this thing out. Let's wire them

together. And at the same

time as I was wiring them, there

was a phone call coming in. So it got a

bit of a shock. But

nothing happened apart from that.

You were shocked twice.

I was shocked twice, yes. Once for real?

Wow, okay, I shouldn't be doing this.

So, yeah, it was no good deal.

I wasn't lucky enough to be shocked. But

it was very common that

either my mom would want to

dial out or someone would be dialing in

and it would interfere with

a connection. And that was

definitely a lot of drama around the fact

that you cannot really utilize the line.

The beginning was two phone lines. So, I

had one friend that had

two phone lines for this very

purpose. I was like, "Oh, wow, he is

living the dream." Two phone lines. One

for internet and one

for like, you know, regular phone. Yeah.

And I had a friend that had ISDN at home.

Oh, that was just...

He was rich. He was one of

the rich kids. I can tell.

That's like he lived in a part of the

city where people couldn't afford ISDN.

Now to this day, we were talking about

this yesterday, your

connection is unheard of.

I think even for most people, like what

they have at home, can you tell us a

little bit about it,

about the connection

that you have currently?

So I live in Switzerland. And there's

this fantastic ISP here called Init7

And they don't pay me

to say this. They're really

fantastic. So, they actually started many

years back when I

moved here 12 years ago.

moved here 12 years ago.

When I moved here, I already had one

gigabit symmetric. So, up and

down. And fiber to the home.

But nowadays, they have 25 gigabit

symmetric to the home.

They're nerds as well. And it's a great

company for other nerds. Obviously, I

don't utilize the full 25

gigabit per second because

it's kind of unearthing more so than

anything else. In the

office, we also have Init7. And

we do have 25. And there, sometimes we do

exercise the full 25.

But it's great. It's great.

Cannot complain.

That's amazing.

I can show you a quick demo if you want.

Yes, please. Let's see it. We have a

Chrome window here. Let's go to fast.com.

Oh, wow. That's not real. It's a bit

slow. Yeah, it is a bit

slow. No, that can't be right.

Can you try speedtest.net?

I can't believe fast.com

Oh, wow. 3.5 gigabits

per second. Yeah. Oh, wow.

So, a couple of things are

happening here. So, this is a

I'm on my Mac Studio. It has

a 10 gig ethernet connection. Then it

a 10 gig ethernet connection. Then it

goes over to an ethernet 10

gig switch as well. And then I go

to our router that has a 25 gig port. But

I actually, because I've done a few

changes, I used to have

so a part of my infrastructure at home is

fiber. I set up by me

and I used to have it like

I had a couple of racks downstairs and I

had it kind of connected

down with fiber. And I think I

damaged one of the fibers. So, I think

there is some loss. I haven't

measured it because this used

to be I used to be able to go to eight on

my Mac. But so, I think

there's actually a constraint

now that the signal is not as good. And I

haven't checked this.

Yeah, it's too slow. Right. 3.5 gigs

is too slow. I love that. Like only a

nerd would say that like,

Hey, I'm like pushing almost four

gigs per second, both up and down, but

it's too slow. This could

go faster. That's amazing.

Sometimes you want to upload something

and it's... Right. well, I

think the problem that you will

see with this, and I'm sure you have hit

it a couple of times, the

whatever you're wherever

you're uploading to, sometimes they can't

accept more than one

gigabit per second. So, sometimes

they're limited, you know, on their end,

because they don't expect

users to have this type of

setup. But that's very nice. Very, very

nice. Okay. Where I really have seen so

nowadays, I don't have

for many years, I haven't really played

any games.

many years ago, I used to play

quite a bit of Blizzard

games, so World of Warcraft,

Starcraft, and they have an installer

that internally uses,

it might even be BitTorrent,

but something like BitTorrent, at least.

So you can really get like

multi stream. Yeah. And that's

just incredible. Like you can easily use

your whole link because

you just be able to pull from

multiple sources. So for things like

that, it's, it's really you

can really tell a difference.

So how does like all this love that you

have for infrastructure

for networks for, you know,

fast things translate to namespace? We

first and foremost, build

something for ourselves. Well,

the origin of Namespace Labs, and the

name is that we were going

to build an infrastructure

company that focuses on software defined

storage, because it was kind

of a big thing that both me

and another person that is not here, HDR,

have a passion for. But

as we were building it out,

we kind of found a few

challenges along the way.

And then we moved over to build

an application platform. And as we were

doing that, we wanted to

run a lot of tests in parallel

very quickly, because we didn't want to

wait minutes for an EKS

cluster to be created, or even

a GKE cluster to be to be created a

Kubernetes cluster.

I put together

something that kind of cut through all

the layers and just focus on the

essential to start a

Kubernetes cluster really, really

quickly, because we wanted to run many of

them in full isolation

to test foundation to test this

application platform. So

that was the genesis. And it was

really for us, because we are developers

ourselves, actually,

majority of the company

is engineers, and we have an appreciation

for infrastructure

that works well, that is

understandable, and it's that it's fast.

So that's something that we try to

project into the products

that we that we build. And many of the

things that we do at Namespace, one of

our product principles

is fast is a feature. So we try to spend

quite a bit of energy on

making things as fast as possible.

Yeah. When you say Kubernetes clusters

that spin up fast, or very fast, what

does that mean to you, very fast?

It had to be seconds, like that's what

made sense. But it wasn't

just a bullish, a we need to,

this should be seconds, but it came from

the source of, can we know

how things work? So we know

how long Linux takes to boot up like the

kernel to start, we know

how long it takes for to scan

devices, we know how long it takes to

mount a file system, we know how long it

takes to start a process

we know, you know, if you kind of add all

of those things up, you get to a point

where where you start

questioning, why does it take minutes?

And so there's kind of

inefficiencies in the system.

And even today in like the Kubernetes API

server is fantastic, but

it has a few things built in

that are not kind of level triggered. So

there's kind of waiting

periods, even we wanted to make

it faster. But even to make it even

further fast, we would

have to go and change the

implementation. So it really came down to

how fast we think this

should be. And I did some kind of

back of the envelope kind of calculations.

And I said it shouldn't

take more than 10 seconds to

start a single node Kubernetes cluster.

So that was the starting point.

Creating Kubernetes clusters

from scratch, fully isolated, so not, you

know, a pod running in

another Kubernetes cluster,

but rather like a virtual machine where

you have access to, to the

kernel, you, your own kernel,

you can, you have your own Rootfs, so you

can decide what gets

packaged into it. And then that

allowed us to start kind of running more

tests and both faster, but the

main thing was the fan

out, we wanted to run many in parallel.

Yeah. So just to have a better

understanding of the scale

that we're talking about, and I'm just

looking for a magnitude,

are we talking thousands of

Kubernetes clusters? Are we talking 10s

of 1000s, hundreds of 1000s?

Like, how much are we talking

about in like, what period of time as

well, just for listeners to have an

appreciation of the scale

that this operates at?

Thinking about Namespace, we do many

millions of runs over

a short period of time. So that's kind of

the scale that we're operating.

And every single

instance is fully unique. So it's

completely new virtual machine.

Everything gets started from

from scratch, the network gets programmed

dynamically for that

instance, there's like

everything is from scratch. But when we

started, like our target

was to run like 100 Kubernetes

clusters in parallel. So you can see that

we the humbling starts and

now we have customers that

now we have customers that

start a high magnitude of

concurrent jobs. And

that's what that's even

one of our biggest challenges nowadays is

supporting that type

of performance. So very

low latency creation at a very high

concurrency, we have tenants. So that's

kind of the the unit in

our in our system that create 1000s of

jobs in an extremely small

period of time. And those

run over many, many, many machines. But

even today, if you go to a

Kubernetes cluster, and you

start the 1000 pods, which is kind of the

quick, if you define some

some kind of equivalence with

another system, you'll see how long it

takes for those posts to be created.

Because first, you need

to commit the state, you need to kind of

the scheduler needs to

design in which machine they're

going to run, then the machine needs, you

need to have IPAM. So you

need to have like an IP address

assigned to the pod. So there's kind of

many things that need to

happen. And as you scale out

the concurrency, you hit serialization

limits, because some of

these they need to be, you need

to have like a consistent view of the

universe to be able to make

a decision, like you cannot

assign the same IP address to two pods.

So you need to have some sort of

serialization. Yeah.

And so that's kind of the types of

challenges that we're tackling today.

Because when you have

a little bit of an aside,

but when you have a natural

partitioning scheme, like two customers,

for example, scaling

across customers is a little bit easier,

because you can

partition your infrastructure.

But when you go inside one customer, then

things start to become a

little bit more challenging.

And that's those types of kind of scaling

challenges that we have today.

Yeah, especially the big

customers that you mentioned that

start a lot of jobs at once. And a job,

what does a job mean?

Like, what does it translate to

in infrastructure terms? Are we talking

containers, virtual machines, how many

CPUs, how much memory,

like, what does the job look like? The

unit of compute in our

world is an instance. But that

instance is a combination of a virtual

machine. So you get full access to the

kernel and everything

in that virtual machine. But it's an

in that virtual machine. But it's an

environment that is designed to run

containers. So it's not

you don't get an Ubuntu virtual machine,

and then you go and you

know, deploy Systemd units,

that's that's not how we think about the

problem. We approach it

from we use containers as a

distribution mechanism. So you define

your application,

whatever you want to run, you

encapsulate that in a container because

it has all of the software

that you need. It also tells

us how to start it as a few other

us how to start it has a few other

properties. And and we place that

properties. And and we place that

container or multiple containers

in a virtual machine that can use

in a virtual machine that can use

an arbitrary set of

resources. So you can decide

whether it uses two or 16 CPUs, or

whether it uses, you know, two or 256

gigs of RAM. So you have like

full flexibility on that. And then also

full flexibility on that. And then also

from a network

perspective, like if you want to

interact with whatever is running in that

in that instance, you get a few

management properties out

of the box, like you can SSH in, you

don't need to configure anything,

But if you want to access

the service that you have, then you also

have primitives for that too

kind of program or ingress.

kind of program or ingress.

We say jobs, because we kind

of approach the problem in a layered way.

We think of the compute platform,

which is a little bit more

generic, as one thing, and then

applications built on top as separate as

something separate. And

a lot of our customers,

they use Namespace to run jobs. And so

those jobs are usually something that

starts, has a purpose,

wants to go really fast, that's usually

the case. And then it ends.

And it could be a GitHub job,

it could be a build guide job, it could

be a GitLab job, it could be a CircleCI job

job. But it can also

job. But it can also

be your custom job, like that you want to

be your custom job, like that you want to

run a system test. So for

example, we have customers that

deploy system tests on instances.

And they can rely

on something that scales out without

being constrained by

whatever resources that they

have available in the job where they

where they started. I think

how people deal with adversity

is very telling. When something fails,

especially when it fails,

how do you handle that tells

everything about you at many levels, as

an individual, as a

team, as a company.

And the reason why I say that is because I know

that you had the major outage

this year. And it was one of

the things that you don't expect will

happen. You prepare for it. And when it

happens, you're like,

wow, I'm so glad we had some

preparations. But it's very difficult to

simulate that. It's very

difficult to fire drill that.

It's really, really hard. So can

you tell us more about what

happened? And how did you respond?

We've been running Namespace now

for some time, so for

close to two years. And we've had our

challenges along the way. But

nothing as big as this more

recent outage. A couple interesting

things there. Like, we had two issues

that happened at the same time.

And I'm lucky that our team is

experienced, and we've operated and kind

experienced, and we've operated and kind

of supported and built

large scale systems over

our years before

Namespace. So that gives us a little

bit of preparation, like

how can things fail? That's

very often when we approach

building something. It's not just a

functionality that it has, but

something that is part of our

something that is part of our

conversation is what are the failure

modes? What if you have an

application, it's stateless,

but it pushes some state into some

database? Well, what happens

if you don't have access to that

database? What happens if you have

multiple requests going

concurrently? And you compromise

on your serializability of your

transactions? Like, how does your

application react to

potential inconsistent states that you

had to do for other reasons?

So we try to incorporate as

much as possible, like a failure mode

into how we approach

features. This big outage that we had,

it was kind of a combination of two

things. Namespace, when

we started, and we used

exclusively hardware provided by others.

exclusively hardware provided by others.

So actually, we started

with bare metal in AWS,

with bare metal in AWS,

and then we switched over to Equinix

metal, or packet. And then

metal, or packet. And then

metal, or packet. And then

we kind of worked with other

providers over time. And fairly early on,

it became obvious to us

that in order to offer

a great product that had an emphasis on

performance, we had to

have a lot more control

over the hardware, not just individual

servers, but also the

layout of the rack. So how much

network capacity there is? Do we know

that one compute node is

next to another compute node?

Is it in the same switch or not? So all

of those things started to

play a role in how we approached

play a role in how we approached

our development. And we couldn't find a good

mix that would give us both the global

reach that we needed,

because we have some customers that want

to run workloads in North

America. We have customers that

run workloads in Europe. And we realized,

well, we have to do it

ourselves. So, Namespace deploys

its own hardware, and software stack on

its own hardware, and software stack on

top of that hardware. So

that means we decide everything

from CPU, RAM, how much storage, how much

networking, what's

the layout of the rack,

how do our racks kind of the spine leaf

setup, how that is, so all

of that is done internally.

And we set ourselves in a journey to kind

of move completely to our

own hardware. And we've been on

a catch up for some time.

We had

in October, a major expansion of one of

our sites coming in, where the

distributor that we work with,

they made a mistake in their order, and

they ordered the wrong

DIMMs for those servers.

And it's a lot of DIMMs. It's not just,

you know, 20 DIMMs or 30 DIMMs that you

can go to a shop and

get. It's actually so many that they had

to go and order directly

from the source. And that added

three weeks more to that delivery. We

were counting on that

hardware, because we knew that

we were already running quite hot. So

quite hot, as in like our utilization is

high. So part of the

reason why we were okay with that is

because we have tools that

allow us to manage utilization

across sites. We can, we can run in

continuous optimizations

where we try to maintain

each site kind of hot enough, but not

more than that. So we can

more than that. So we can

kind of move things around.

but globally, because of that missed

delivery, we were running quite hot. At

the same time, one of our

existing deployments in a company that

offers kind of bare metal

that we used, they started having

an issue in their network product, which

we use to connect multiple servers

together into a single

layer two segment, where it led to

sporadic packet loss. And at first, while

the internet is built on

sporadic packet loss, so things just kind

of work. But as that became

worse over time, it was so bad

that it had a real impact into our

customers. And we interacted with that

customers. And we interacted with that

vendor and for various

reasons, they kind of acknowledged, but

they didn't react quickly enough to the

problem. So we decided

that that wasn't acceptable, the level of

service that we're offering to our

customers, the fact that

we were a source of flakes, because of

that kind of random packet

loss, it was not acceptable. So

we strategized and we made a decision on

changing our network setup

so that we wouldn't depend on

that particular feature. That meant

though, that we had a dip in our

capacity, because we had to

redeploy that part of our infrastructure

that it has to do with the

fact that we run an immutable,

we try to be very immutable. So as

machines move from one setup to another

setup, they need to be

reset up to get new keys. So there's kind

of something else that

kind of plays a role there.

And that took some time, we had practiced

that, but it took longer

than we anticipated. One of

the challenges was we rely a lot on state

that lives on individual

machines and not on the

networks to enable fast performance,

networks to enable fast performance,

bootups, etc. And distributing that

state, because we had a much

bigger fleet versus what we had done

before, for that particular region took

longer than we expected.

So it's kind of distributing all of the

state across all machines, it

highlighted a few bottlenecks

that we had. And that build up took some

time. So we were running

really, really hot for some time,

we had part of the team just trying to

support our customers,

making decisions on, okay, we're

going to move this customer to this part,

because now it's actually their peak

time. And we want to

time. And we want to

make sure that they get as good of

experience as possible. So

there was kind of part of the team

that was just trying to offer as good of

support to our customers as possible,

where the other part

of the team was just kind of rebuilding

the region. And we did it,

but it was extremely taxing.

It's primarily because we feel such a

strong commitment to the

services that we offer,

because then we've experienced,

there's something that we

depend on as a developer,

and then it's not working. It's just the

worst, right? I cannot do my job.

So I think emotionally, it was extremely

taxing. I look at it a

lot from the human side,

like you're trying to do something that

is, you're trying to do a

great job to your, you're trying

to provide a great service to our

customers, but then we let them down in

that particular moment.

And we tried to be transparent about it.

We wrote a postmortem.

The things that I mentioned

and more are there. We learned a lot in

that experience. And to

be honest with you, we were

expecting that some customers would come

to us and say that this is unacceptable

and we're moving on.

But not a single customer left due to

that outage. And we

actually got a lot of support,

and I have a big appreciation for our

customers. I met one of our

customers in San Francisco

a few weeks after the outage, and they

said, "Yeah, it was 3

p.m." And we decided, "Okay,

we're going to call it a day now, because

it seems like our jobs are not running.

But we're so happy with the service that

you folks usually provide

to us that it was one day,

and that's okay." But yeah, it felt very bad.

It wasn't a complete outage, right? It was a

degradation, a significant degradation,

but not every customer was

impacted. So this was limited

to one region. That was the blast radius,

and you have multiple

regions. So that's one. The second

one is that not all customers were

impacted the same amount, right? Because

as this was happening,

you're also moving customers off, which

I, you know, that's

something which I missed. And I will

go back to the post-mortem, by the way, I

will add a link to the show notes

The way you handled something failing, and

something failing in a very

significant way, right? The whole region

going away, or being unusable

You were able to

be hands-on, you understood how

all the pieces fit together, which means

that you were able to do

something about it, rather than

putting your hands up in the air and

saying, "Hey, it's the provider, we can't

do anything about that."

Think about what happens when you're in

AWS, or GCP, or Azure, or,

you know, one of those big

providers, what can you do? And you say,

"Well, I'm going to move things off it."

You may have so much

stuff to move off that you can't move it

off, not to mention that if

there's an outage, how are you

going to move stuff off? Especially if

the DNS is there, you can't get to the

DNS, you can't update

it. And this happens, you know, for many

companies, and many, many

businesses. And at the end of the

day, we are humans. The internet does go

down, or at least half of

it. I remember when Fastly went

down, they had an outage, or CloudFlare

went down, or Facebook

starts, you know, the BGP routes get

all messed up. Now that's bad. But in all

of this, there's always

something to learn. There's

always something to improve. And the best

approach is to get

better on the other side.

You mentioned something that I think is

really important that there's a

particular company that

we work with, and they played, like one

of their products wasn't

working to spec. But from my

perspective, that's on us. We decide who

are the companies that we

work with. And I don't even want

to throw them under the bus. Like they I

think, perhaps other

customers didn't have the same

choice. Perhaps it was the way that we

were using their

infrastructure that led to that. And it's

really on us. And obviously, when we work

really on us. And obviously, when we work

with someone, and they

can provide us great support

that helps us get the resolution faster.

Great. But in that case,

it was I felt, and the whole

team felt like a commitment to our

customers. And it doesn't

matter if it's, if it was like a

delayed delivery, or if it was a

particular upstream, or if it's a

particular provider,

that's really on us. And that's we felt

that, okay, we need to do something,

we're not just going to say,

you know, this is not usable. We

need to do something to get back to

service to our customers.

I think a lot of props should go to the

team. Obviously, I kind of

team. But I think some of these, the our

ability to handle some of these

situations is, if you

find yourself in a situation and it's

new, and you're not prepared,

it's going to be much harder.

So the more you prepare both, hey, this

disaster scenario is

possible. And maybe even just maybe

you don't even run an exercise, maybe you

just talk about a here's

what we would do so that you

just have a shared understanding of what

are the tools that would be

available to us if something

like that happens. I think that's already

the first level of

preparation. And then it goes

to the architecture. As well, we try to

to the architecture. As well, we try to

present a global service to

you so that you don't have to

think about regions and capacity and all

of that. But behind the

scenes, there is partitioning,

both for performance reasons, but also

for reliability reasons. And that

design principle also allowed us to

continue to serve our customers, even at

a very degraded state

while recovery was going on. Because from

their perspective, things

got slower, because we just

didn't have enough capacity for all

of their jobs. So how does

Namespace come together for a

user like myself, or for regular users,

I'm very keen to basically

see what it means when we use

Namespace on the command line. How does

it compare with the local

stuff that you may have running

locally? Because that's also like,

sure, run things locally. But not

everyone has 25 gigabits

at home. But even then, when you do you

want the we want the

resilience. So I'm wondering if

there's something that we can screen

share. There's something we can look at

just to see how this

comes together in practice. Yeah, I'm

happy to show you a couple things. I

would preface though with

we're very pragmatic. There are certain

we're very pragmatic. There are certain

parts of your developer workflow or

parts of your developer workflow or

certain things that you want to do

where running

them locally will always

be kind of the right thing.

We really like to think about what's the

right tool for a particular

right tool for a particular

job, where we try to excel

at is scale out. So you can get things

running really well in your

machine. But now you want to

run 1000 of them. And we could try to

find 1000 machines to run them at home.

But you just probably

that's not kind of the best, right. So we

try to apply things

where we can provide like an

where we can provide like an

non trivial amount of value where it kind

of makes sense to kind of

of makes sense to kind of

move over from whatever else that

you're doing, whether it's local

development or something else.

Then Namespace, there's,

there's different ways to approach the

product. So we're both an

infrastructure provider. So that's

where the nsc CLI comes in. But

we're also a service

provider. And I would say actually,

majority of our customers, they use our

prepackaged solutions.

So they, they want to do

Docker builds, they want their Docker

builds to be as fast as

possible. So we have kind of a

prepackaged Docker build product, they

want to run a Kubernetes

cluster really fast or 100 or 1000

or 10,000 of them, we have a prepackaged

product for that. They want

their CI runs to go really

fast, or in a very cost effective way,

actually, we start hearing a

lot more around kind of cost

management. So we have products for that

as well. We'll focus today a

bit more on the infrastructure.

So kind of under the covers. But if

you're if you want your CI

to go fast, you don't actually

have to run nsc CLI, like there's

products that make that super

easy for you.

I was thinking of starting at the origin

So, things started in kind of

building this application

framework. And this application

framework. And this application

framework, we we leaned on on something

framework, we leaned on something

akin to what we had built

at Google, which is a platform called

BOQ, where it tackles how

you write services, how do

services talk with each other? How do you

build services? How you test

services? How do you deploy

services? How you observe services in production?

To see how Namespace tests Foundation,

the open source application platform

inspired by Google's Boq,

find the YouTube video

link in the show notes.

After Hugo's demo, we look into how a

remote Docker build can

be faster than a local one.

That is a separate YouTube video,

link in the show notes.

OK, let's start wrapping this episode up.

We see more and more use cases around

kind of complex scenarios with

previews. This is an area that has also

previews. This is an area that has also

been a pain point for us.

And we want to do better.

Another thing is instances right now they're fully isolated.

So two instances, they don't share any networking.

But we...

We have a POC internally that uses tailscale

where you can connect multiple

instances. But we're also

thinking of just adding a tagged mode where you tag an

instance with kind of a network. And then

instances that are tagged the same in the

same network, they can

communicate with each

other. So you could have like a front end

that calls a back end.

Our goal is not to kind of cover all of the possible,

you know, compute use

cases, but just things that are

helpful and ideally kind of easy to use

to achieve what you want to achieve.

Typically creating a preview, you can go

all the way to a pass,

like go to a solution that

just packages everything and then you

have very little flexibility, or you can go, "Okay,

I need to do everything from scratch."

And we try to be somewhere in the middle

where you have kind of

building blocks that are

helpful but you still

can make it your own.

You can still decide what goes inside of my container?

Is it multiple containers?

Whether I want authentication, I want authentication

So all of that, it's kind

of more our mental model,

kind of our design principle of being

somewhere in the middle with

not fully packaged but also not

completely done from scratch.

As we prepare to wrap up,

one last thought,

one last takeaway from our conversation

for people that stuck with us to the end.

What would you like them to take away

from our conversation?

I was asked recently,

how does one become good at something?

And I've worked with so many engineers

that are extremely good.

And I've been looking for patterns,

like what are the things that are common

across these engineers?

And I find that it's usually some kind of

unrelenting curiosity

that really propels people beyond just

being good to being excellent.

And I think that kind of comes back to

when we approach how

we build our products

is with that same level

of unrelenting curiosity

and willingness to break

through and change things

that may help us build a better product.

And I think having that

courage has been helpful for us,

courage has been helpful for us,

but when we bring people in, try to

but when we bring people in, try to

instill that same spirit of

just go deep, read the

code, try different things,

see how it works, that just really helps

propelling us to just do better.

Well, on that note, thank you very much

for joining us today, Hugo.

I look forward to all the improvements

you'll be driving in Namespace.

I think you're on to something here.

I really like the speed, I really like

the simplicity in many ways,

and I know that behind it, there's a lot

of complexity that you need to handle

to make things this simple and this fast.

Thank you very much.

Thank you.

And I look forward to the next one.

It was a pleasure to be here. Thank you.

Creators and Guests

Fast Infrastructure
Broadcast by