CppCon 2017: John Lakos “Local (‘Arena’) Memory Allocators (part 1 of 2)”

– And I guess we're gonna get started. Hello. My name is John Lakos. I have been at Bloomberg
for now about my 16th year. Prior to that, I worked at
Bear Stearns, which is no more, and that is where this
notion of allocators became a real thing. And prior to that, I
worked at Mentor Graphics, where I used memory
allocation for a long time and actually talked
about it in my 1996 book. So we've been dealing with allocators, we meaning me, for a very long time. And this is not new to me. But some of you may feel that this is new. It's not new. Also, people in the video games area know this stuff better than I do, and they use it all the time. And it is not magic, and it is not syntax or any of that stuff. It is absolutely essential in order for them to do their jobs. And the way I came on it, it was absolutely essential
in order for me to do my job.

So if anybody doubts that this is real, by the end of this talk, there should be no question
at all that it's real. The only thing that
you should be left with is to make the decision,
is it worth the effort? The good news is, it
actually is worth the effort, and when we get to C++20 and 23, the effort will go to zero. So that's good news. And then you'll start to worry about, "Well, what about the performance cost?" Ha.

Don't even get me started. Now, there's a lot of material in here. Normally when I give a talk, I have like five, six,
seven, eight hundred, nine hundred, maybe even
almost 1,000 slides. You'll be pleased to know I
don't have more than 500 slides. The bad news is they're animated, which makes them like 5,000 slides. (audience laughs) We have two hours, though. Hopefully I'll finish early. I don't mind if you stop me
to understand the material.

There's gonna be enough time
at the end of the second talk for you to grill me on any
concerns that you have. I look forward to it, I really do. So any questions that you have
about what I'm talking about, please do interrupt, and
I will try to remember, Mr. Sign Man, to repeat the question because I need to do that. Does anybody have any pre questions or anything about the rules, which is, if you don't understand
something, you stop me, but if you wanna heckle
me, wait 'till the end? Okay, good. All right. The first slide is there because Bloomberg has a standard slide, but if they thought, or anybody thought or if it were even possible to go in and change all my animated
slides to these Bloomberg slides, that was not happening. So somebody just put it back. This is the Bloomberg slide. Everybody. (inhales deeply) Okay, that's done. And this is John's slide.

We're gonna use John's slides from now on. All right, this is the Copyright. It's there because it has to be there. This is what we're talking about. Are they important? Are memory allocators
really worth the trouble? What situations can we use them in? How can we apply them? What's the performance impact? By the way, performance
is extremely important when we talk about allocators, but surprise, it's only one of three areas where allocators are very, very important.

So we're gonna talk
about performance today, but if I were gonna argue
the value of allocators, I would have two other complete talks about why they're important. But today is just performance. By the way, at the end of this, you won't care about the other two. So here's what we're gonna do. I'm gonna talk a little
bit about background, because this is, if anybody
saw Bjarne Stroustrup's talk where you need to know about the layers, you can't deal with the
highest level abstraction, you have to go down to the libraries and perhaps even to the operating system and perhaps even to the
hardware, guess what.

This talk is gonna have a little hardware in it, unfortunately, so we have to do that. But we shouldn't be
embarrassed about that, right? We should be proud of that. We're gonna look at the
hardware a little bit. Even the hardware that I'm gonna
show you is an abstraction, because the idea, it's the gist of it. It doesn't matter. If we get too precise, then we have to look at
different processors. We don't want to look
at different processors. We want the gist, because it
is a very rich, robust benefit that we're gonna get,
so we're gonna keep it above individual architectures. We're gonna talk about
a general architecture, although we will run benchmarks
on a specific architecture. We have run them and
they're published on GitHub for many different platforms,
different architectures, different compilers. The results are more or less the same.

You couldn't possibly care. Your eyes would glaze over if I showed you all of the data we have, so I'm not going to. Let me show you some stuff. So, why do we like C++? Well, a couple of things. It enables us to fine tune at a low level, and it can deliver a very high runtime
performance when needed. That's why we use it, right? Otherwise, we'd use something else. Why should we care
about memory allocators? Well, they enable us to fine tune at a low level when needed, and they can help improve
runtime performance.

So it seems like allocators might be a good match for C++, right? Seems like it, right? We want high performance. But no, we can't be
bothered with the platform, so just forget about allocators. Doesn't work that way. It really doesn't work that way 'cause there are people who will not use the standard library without
allocators because they can't. I was in such a position a long time ago, and I didn't use the standard library. What are the benefits? Not all memory is alike. Different kinds of memory. That's one of the benefits.

You can deal with
different kinds of memory. You don't have to say all memory is alike. Another one is we can do testing. We can do debugging. We can do measuring. We can do profiling that
we couldn't otherwise do. If you talk to somebody
who runs a large system, the profiling is the most important thing, not the performance, but the profiling, because when something goes bad and you need to figure out what's wrong, this is how you do it.

When memory is getting sucked
in somewhere, what do you do? Really, honestly, this is how you do it. Oh, yes, runtime performance. That's very important to some people. Not everybody, maybe not
even half the people. And given that allocators
today require some effort, and we'll see that it's
not as bad as people think, but that's not the purpose of the talk. Purpose of the talk is simply to explain that they actually are worthwhile. The next step is to figure
out how to deliver them at the least possible pain point. C++11 is the most possible pain point. C++98 is they don't work. Bear Stearns, circa 1997. I invented, I, taking credit, invented what is now the C++17 memory model. Make no mistake, it doesn't
look like what I did, because what I did was
so obvious and simple that anybody could understand it, and what we have in C++17 is a vast improvement over C++11, which was a necessary painful step because people had fear,
uncertainty, and doubt about polymorphic memory resources.

No, it had to be part of the type, 'cause that's all we know. You know, fear of the unknown. By the way, fear, uncertainty,
and doubt, that's FUD. There is a lot of FUD going around. We need to debunk the FUD. That's what the fun questions are gonna be at the end of the talk. You're gonna say, "But it
isn't it (mumbles) like this?" Nope. All right, so we're gonna see
that, but it'll become clear. So back then, we had
coalescing allocators, and so somebody came to me and said, "I have a model, and
I build up this model. "And I use it. "it's great. "Then, when I go to destroy it, "it takes 10 seconds
to reclaim the memory." Can you imagine 10 seconds to reclaim the memory for a model? So it turns out that because
we had coalescing allocators, the benchmarks were all based on how fast we could allocate the memory, but they didn't care because
you'd just leak the memory and it goes away, and who cares? So to actually reclaim the
memory was the big bottleneck.

I gave them a local allocator
using my model from 1997, and instead of taking 10 seconds, you couldn't see it on Purify
because it didn't exist. That's the kind of improvements
we're talking about, from 10 seconds to zero. Another one, Bloomberg. Whoops. Did I go a little fast here? Hold on. 2002, Bloomberg. We have a way of doing swapping to disk. The way we do it is we have something called finch memory. And so we have memory mapped IO. We store things that we wanna
save in a process to disk. And because all of our
processes are the same, we can read them into any other
like process on the machine, and we're reading in binary, but it can just come back the way it was. You can't do that without allocators, and you can't use the standard library unless you can provide
a finch memory allocator to take and put your memory, right? So we need to do that.

That's the placement aspect. And then, in 2006, people realized that you can use
allocators on the front end to make the user interface zippier. That's something people observed. And people who were not at all happy with the way I did
things at Bloomberg said, "That's the best thing you've ever done." I'm like, "Okay, thank you. "I'll take it." So what are the common arguments against? I'm sure I don't need
to give you arguments, 'cause I'm sure you all have arguments. "Oh, allocators, not worth it." Well, they require more up front design. Well, software engineering
is about up front design. There should be a payoff
for it, of course. Complicates the user interface. Really? How much? How are you designing it? These are all fair concerns.

May degrade performance. If I try to use an allocator here but I don't really need it,
it's gonna slow me down. I'm sure you've heard that. Suppose we don't need an allocator. We don't need a special egg. Why should I worry about it? And that's most of the time, right? 'Cause when you're using C++, performance isn't typically important. It's only sometimes. So why should I worry about it? Maybe I choose the wrong allocator and blow the whole thing up.

Do we have to actually
know how to program, or can anybody write C++? I don't know. Anyway, so these are all valid concerns, and we can deal with them only by well supported facts
and careful measurement. Does that sound reasonable, as opposed to fear,
uncertainty, and doubt? That ends here. There is no doubt, none. You're going to be shocked
at how little doubt there is. We're gonna talk about
some background material, like main memory. What does main memory look like? Well, we have the CPU
and we have main memory.

Everybody knows this. This is three levels down the
stack where we have hardware. I want you to appreciate the amount of time I
put into drawing that. You see the two gradients there? That's the wear from that wheel turning. Now, I'm gonna start it up. Hold on. First, we're gonna do some memory. We're gonna put some memory there. Did you see it turn? (audience laughs) And then we're gonna put some more memory. Oh, it turned again. I hope you're enjoying this, because I got very frustrated with how much time it took to do this. (audience laughs) So I kept it up for just long enough for you to know that I
can do it, and that's it. (audience laughs) (audience applauds) Thank you. Applause are welcome. Then, I'm going to do
something that is unthinkable. I'm gonna add a cache. That will never happen
again, that thing over there. We're done with that. (audience laughs) That's staying fixed. And so now we're gonna add something, and you'll notice that what
happens is we get a cache line.

And everybody has heard
the term cache line. For those few of you that have
never heard of a cache line, it's a chunk of memory, contiguous memory, that gets pulled into
the cache all at once. And so the next time you ask for something on the cache
line, you already have that. So watch what happens. That's one cache line. Here's something from another place. That's another cache line. The cache can take something from anywhere in memory and put it there, so now you have these two things. But now, when we add
something else up there, it's instantly in the cache, 'cause it was already there. And so on. So we just keep doing this. And eventually, we might
write to the cache like that. And then we have to save it. Now, there are all kinds of
algorithms and everything. Please, let's not worry about it. The idea is that has to be
put back to main memory, and then we'll just get
rid of it from the cache.

And now that space is
available for something else. Look there. And so we get a cache
line, and it goes there, and that's how a cache works. And there are many levels
of cache, and we don't care. Then there are things at a
higher level, like paging. So when you have so many
pages in your real memory, then you have to page out. And that's kinda like a cache, because it goes to a slower memory. So we have all these different levels, from registers to L1 to L2
to L3 to virtual memory. All of that, right? What you need to know is, things that are accessed
frequently in some block of time need to be close together in memory. If you can deal with that
concept, that's locality. That's good. Not having locality is bad. Can we all just accept that? Because we should have known
that for the last 50 years. Okay, good. Ah, so now we're here. We have an executable… Did I jump over something? Oh yeah, main memory segments. This thing is going a little funny. All right, anyway, this is what happens.

We load in a program, and it becomes an executable program, and then we have stack memory, which grows down typically, let's say, and dynamic memory, which grows up. And it can be in the other direction. It doesn't matter. But these are the two ways we get memory other than static memory, and static memory, of course, lives in the executable program itself. There are all kinds of
different static memories that are more efficient
and less efficient, but it's not important.

So what's a memory allocator? So, a memory allocator
is something that, well, allows us to control memory. With C language memory allocation
utilities, for example, we have malloc and free. And malloc and free
work on dynamic memory. Now, we have another one
some very smart people who really cared about
performance back in the day decided that this thing
called alloca was important. What that does is it gives you memory on the hot program stack.

And that's really good if
you know what you're doing. If you don't know what you're doing, your program will crash and burn. That works on bits of memory in the stack. How am I going with speed? Everybody with me so far? Okay. I don't wanna lose you. So, this is a first cut
at a memory allocator. A memory allocator organizes
a region of computer memory, dispensing and reclaiming
authorized access to suitable sub-regions on demand. Might not be contiguous regions. That's okay. But your first thought is, it's a block of memory, and
I'm gonna get stuff out of it. All right. So we have general purpose allocators and special purpose allocators. And you can imagine a general purpose allocator
works for everybody, and it does a good job all the time. And a special purpose allocator doesn't work for everybody, but it works for some
people real, real well, and so you say, "I know more than the
general purpose allocator.

"I'm gonna do this." And if you get it wrong,
the whole thing blows up. So if you don't know what you're doing, don't use a special purpose allocator. But if you do, you can beat
the compiler, hands down. Then we have a global allocator
and a local allocator. A global allocator is good for
the duration of the process, and a local allocator
isn't necessarily good for the duration of the process. The global allocator is
accessible from everywhere, and it doesn't need
any reference or memory because it's global. It would just go there. Local allocator, you
have to provide a handle so that it knows what's going on. And the local allocator doesn't
talk about all of memory, but a subset of memory. And it doesn't talk about the entire duration of the process, but only a subset of the
duration of the process, so it's a little time space box. You can use this memory right now, just this memory and just for now. That's a local allocator.

So in C, we have malloc, and in C++, we have new and delete. But those operators are not allocators. They are ways we access allocators, right? They're not allocators. They're syntax. So now we have global and local allocators and general and special
purpose allocators. Look, we have some things in the box. And you notice that malloc
and gree and new and delete are global, general purpose allocators. But they're not really. They're just specs for
general purpose allocators. Now, tcmalloc and jemalloc are not specs. They're actually allocators. They're general purpose allocators. And some might say,
"Well, if we have those, "we don't need local allocators." Some are wrong. Then, we have special purpose allocators, and they could be global.

They could be globally accessible, but that would be pretty dangerous, but not necessarily. If you're in a program
that's single threaded, do you really need an allocator
that does synchronization, or can you just say, "I don't need that, "because I don't have multiple threads. "So why do I need synchronization?" So that might be a reasonable improvement, and it turns out it is. You don't need synchronization. Why on Earth would you pay for it? Okay. Local allocators. Again, if you know that you're about to do
something very intensive, "I'm gonna go over here "and I'm gonna work on this for a while "and then I'm gonna stop," local allocator is your
friend, because all the memory, all the stuff that you're
doing is right there. That means the number of cache lines that are
in play are much smaller, which means the number of
hits you're gonna get in cache is much greater. And it applies the same
to pages and memory and every level of cache
up and down the hierarchy.

It doesn't really matter
why you're getting it. The key is locality is king. So if you're gonna access
something in rapid succession and you can do all of it
in a boxed region, you win. Otherwise, you lose. Does anybody have any questions on this? Going too fast? Locality is good. That's a fact. Okay, good. A memory allocator is a
stateful utility or mechanism that organizes a region
of computer memory, dispensing and reclaiming
authorized access to suitable sub-regions on demand. So here's a local allocator. And the computer knows
how to access that memory.

The allocator itself has some logic in it, but it's tied to memory. Now, some people seem to think
you can copy an allocator, and if you think you
can copy an allocator, I got news for you. No. An allocator is, the important part is that it's a region of memory. You can't copy it because it's physical. It's not a value. It's memory. Right, okay? So we know about value types. They represent values, and I have talks on that out the whatever. But this is hardware, and all we're talking about is a pointer. Every allocator that you ever deal with is a pointer to some region of memory and some logic that controls it.

The logic lives in the object, but the memory lives in
memory, and that's a region. All right. So here's a local allocator, and you notice it has
a default constructor, like any value type, but
it's not a value type, so don't get confused. And what it probably has, it's not a default constructor, actually. It takes the region of memory
that it's gonna operate on, and then the rest of it is the logic to allocate and deallocate. And then, this poor clicker
was doing things too quickly.

To allocate and deallocate, and those are the operations
that we normally have. Now, I know there's alignment and all of these wonderful, good things that we could talk about. We're not gonna talk about them today because they don't fit on slides. So we're just keep it real simple, because we can understand that, and then we can extrapolate to the reality which is just a little bit more complex.

But what's nice about local allocators that's not true of global allocators is we have this third thing. This third thing is called release. So while new and delete don't have any notion of
release, nor should they, a local allocator, you can
say to the local allocator, "You know on that memory
you were managing? "Forget about it. "It's over. "Just give it back to
wherever it came from. "You don't even care
what the objects think. "It's not their call. "I know. "Give it back." When you're in that realm, now you're playing with the big people, and you have to know what you're doing. But if you can save significant time in a video
game by doing it, then why not? Whoops. Okay. Yeah. – [Burke] What is the
purpose of this release? Shouldn't you put that in the structure? – Okay, so the question is,
I will get to this more, what's the purpose of release? So we have an object.

Call it your, what's your name? – [Burke] Burke. – Burke. We have Burke. Burke allocates memory. We have an allocator. Burke takes an allocator construction. When Burke needs memory,
Burke says, "Allocate memory." Now you're full of memory. Now, here's John. "Burke, you're done." (audience laughs) (imitates gunshot) You're gone. I didn't call your destructor. You're just gone. Now, you're probably
saying, "Is that okay?" We'll get to that.

(audience laughs) It doesn't feel good to just be dismissed like that, does it? It's horrible, but it's very quick. You're gone, all of
you, in one shot, boom. (audience laughs) Sorry about that. Good question. So a memory allocator is
the client facing interface for stateful utility or mechanism that organizes a region
of computer memory, dispensing and reclaiming
authorized access to suitable sub-regions on demand. I'm trying to build up an idea. We can supply allocators in multiple ways. We could supply them as
stateful utility functions, like malloc and free. But that doesn't really
support allocator objects. We can't really think of
an allocator as an object. It's a pair of functions. It's not, God forbid, object oriented. Then, we could think of it as a reference wrapper
in a template context where we're gonna pass it
into, as a template parameter, and now the type knows
what allocator it uses.

This is the C++98 model. This is also the C++11 model. What's good about that? Well, the concrete allocator derived type is available to the client's compiler. That might be good. Unfortunately, it forces the client to be a template in order to use it. That could be bad. It affects the type of the object, which means it's not a
convenient vocabulary type, because a thing and a
thing with an allocator aren't the same C++ type. I remember where I was in my house. It's a very small room on the first floor, and I was thinking, and I said, "That's a problem with templates.

"They need an allocator in order, "and then the type changes. "And then what are you gonna do? "And how do you take a function? "Oh, wait, the function
could be a template. "But wait, then everything
has to be a template. "Wait a minute." 2004, that was not cool. It's still not cool. Or we could pass the address
of an abstract base class. What's good about that? We can actually refer to a polymorphic allocator with a handle, and it doesn't effect
the type of the object. Yay. The bad news is,
allocator must be accessed via its virtual function interface. And people say, "Oh, can't afford that." How much does that cost? Does it cost anything? Has anybody ever measured it? – [Man In Audience] Yes.

– Yes, and? – [Man In Audience] 20 cycles. – It takes 20 seconds? – [Man In Audience] 20 cycles. – 20 cycles, on your
machine with your compiler? – [Man In Audience] It's average. – Averaged over? – [Man In Audience] (mumbles) Strictly speaking, it's
like from five to– – Okay, so let me just cut to the chase. So you have platforms. You did this measurement last week? (audience laughs) I wanna point something out. We did our measurements
about a year and a half ago. Some platforms there
is a performance cost, and some there aren't. And the ones where there are haven't caught up to the
ones where there aren't yet, and as soon as C++17 adopted this, there's now pressure
for people to catch up.

I'll explain why that can be done, but don't assume that
there's a performance cost. The other one is the object
now, unlike the C++11 model, whether it needs it or not
is gonna somehow have to tuck away an address. And if we assume that most things don't need their own custom allocator, now we're paying a machine address size, let's say 64 bits, in every
object that allocates memory, not necessarily in the footprint but in every object
that actually allocates, not one that could allocate, but in every object that
actually allocates memory, we've gotta find space
for a machine word size.

Some people will say, "Well,
that's out of the question. "We can't use allocators, "'cause that's gonna slow
the whole thing down. "It's gonna make it unworkable." Have you measured it? So that's the introduction. Does anybody have any questions on this? At least we know some things
we're concerned about. Maybe this is a bad idea. If it were a bad idea, would I really be standing
up here talking to you? Anything else? All right. So now we're gonna understand the problem. So here's the problem. I wanna know if I should be
using an allocator or not. So, should I supply an allocator? Whoops. Darn, this clicker. If the answer is yes, I'm done. I mean no, I'm done. I don't need to supply an allocator. But if I do need to supply an allocator, then should I supply it via a base class or should I bake it into the
type of the class C++11 style as opposed to C++17 style? So, whether I do or not is a syntax issue.

The next question is, which allocator should I use, A, B, or C? Whichever I decide, I have
to ask yet another question. Do I want to bother destroying the individual
objects and sub-objects that are fetched from the allocator, or do I just want to make them go away in one shot in zero time? I have to decide that. So that's where I am. This is our decision tree. And I'm gonna see if I can
learn how to do this better. That's our decision tree. So we're gonna go through this process. The thing is, what's gonna
guide us through this process? So, we have a tool chest of strategies, and the first strategy is we're gonna use the global allocator. The next strategy, or that
strategy has two ways. We can use the global
allocator as a type parameter or we can create a wrapper pointer to what's called the new
delete allocator, right? So I can have an object that all it does is invoke new and delete, right? So those are the two possibilities there. Then, I have three possibilities.

I can use what's called
a monotonic allocator, this is a local allocator,
a multipool allocator, or a combination of a multipool
and a monotonic allocator, so now I've chain them together. So what the third one is is it says from new and delete, or
whatever is supplied to it, I'm gonna create a big block of memory. I've got that.

To that, I am going to
affix a pooling allocator, and it's going to pool from that chunk. Whereas just having the
multipool allocator in theory, every time you needed a new chunk, you would have to go and
get it from whatever, but when you gave it back, it would keep it in its free list. That's not how we do it
because we know better, so we're not gonna do that kind of thing. We actually build into our
multipool some buffering that makes it ridiculously efficient to the point where nothing can compete. But that's to one side. So I'm trying to give
you the slip in the truth while I'm telling you the story so you can understand what's going on. The combination of the
two has applications, and either one individually
has applications. We just need to know when to use what. And then, again, we can
pass it as a type parameter, in which case we're
baking the allocator type into the container type or whatever type needs to allocate memory.

Or we can use an abstract base class and pass it in on construction, and now it doesn't affect the type at all. And then the third one, because it's a local allocator, we can either destroy it
normally or we can wink it out. Winking it out means it had
nothing to say about itself. It's just gone. So, that gives us 14
allocation strategies. And that is ample for this talk, and it's ample for anything
you probably ever wanna do when it comes to performance. But it's not ample for
anything you'd wanna do, because you might wanna do something else, like place memory somewhere or you might wanna
instrument things, whatever. But as far as just performance, you're really in good shape here.

You're gonna have to look really hard to find a reason to do more than this. How are we doing? Okay. So this nice little chart
shows you what we got. We got 14 allocation strategies, and then we have how
we're going to affix it to the container object, let's say. It doesn't have to be a container,
but we'll say container, like a vector or a map or whatever. But it could be anything
that allocates memory. And then we can decide whether we're going to destroy
it normally or wink it out. So here are all the
possibilities, nice little chart, and so let's try to understand it.

So here's the normal
way of doing business. So we have an allocation
strategy, either AS1 or AS2, and the only difference
is whether we bake it in or we use an abstract base class pointer. Take a look at this. Here's an allocator. There are no data members. And this is the standard allocator. You see allocate and deallocate. Now, notice that I put them
right in line right there. That's what they do.

This is not a very complex class. This is the standard allocator. It's also the global default. And we're done. That's all we're talking about. Then the next one is, let's see. I wanna use it, I just say that. And by the way, if I were
to supply the allocator, the underlying machine
code would be the same. So if I don't supply an allocator, we get the default allocator. If I supply the default allocator, we get the default allocator. So these have the same code. We're not gonna make a distinction. That's a type parameter. Remember, the type
parameter changes the type, so STD vector with allocator
is a different beast than just plain old STD vector. With a non-default allocator, I'm sorry. That's the same. Now, the next one is, okay, so, here we have our benchmark. We're gonna create a vector, and we're going to populate it like this. We're gonna do our
benchmark on the vector. And then we're gonna destroy it like this. That's the old way. Now, just to prove that
I know how to do this, I'm gonna use auto.

And I can go back and use auto again. The purpose of this talk, of course, is not to talk about C++11, 14, or 17, but really to talk
about memory allocation, so I'm gonna try to keep it very simple for those people that
might not be up to speed on all of the things. It's not important. Next, we have AS2,
which is the same thing, but now I'm going to have it
be an abstract base class, so this is the protocol. Here's my allocate and deallocate,
pure virtual functions. And you can see, this is
a concrete derived type. What does it do? Now, I want you to look at this. Notice that I have a
bug here in the slide. That's on purpose. The bug is what? What's the bug here? That's the bug. I put the inline there
because I'm really serious that this is a derived class that has inline virtual functions.

The bug is I don't need inline because it's gonna be inline anyway because you can see
that the code is inline. I am not kidding. I have an abstract base class. I have a derived class. The derived class has
virtual inline functions. Does anybody have a problem with that? – [Man In Audience] (mumbles) compilers– – What? – [Man In Audience] Some compilers, if you have pedantic warnings enabled– – With the inline? – [Man In Audience] I think so.

– The inline won't compile. Repeat the question. Thank you. So the question is, if
I have the inline here, will it create a warning? No, it'll create an error. It's not syntactically valid. I put it there to remind
you that I'm not joking. It is inline. That's important. Everybody understands, these
are deliberately inline, not just slideware. They're inline for real. There's a reason for that,
because we want the compiler to be able to take advantage of it, and I am not talking about
full program optimization. I'm talking about what
the compiler can see today in your typical compiler scenario. – [Man in Audience] So it doesn't compile? – This doesn't compile because
the word inline is there. So when it doesn't compile,
you delete the word inline and then it will compile. (audience laughs) Okay, see this word inline? You see how right here
there's an inline definition? Pick one.

– [Man in Audience] Okay. – Okay, all right. Okay. So now, this is what we do in this world. We create an allocator, and then we pass in the
address of the allocator to the PMR vector, 'cause
that's how it works in the polymorphic version. That's the C++17 version. So instead of baking it into the type, you pass in the address
as an optional argument, and life is good. All right. So I'm claiming the virtual functions can sometimes be bound to compile type, and the really cool thing is that in every case where that's
important, they can. And if a compiler doesn't
do it a year and a half ago, it probably does now, and
if it doesn't do it now, as soon as C++17 is out and about and the game people get
ahold of them, they will. For example, GCC does but Clang doesn't, didn't, as of a year and a half ago. So, Clang, if you're not doing it already, you have some work to do. All right.

So we have normal destruction, because what else can we have? It's not a local allocator. So what does normal destruction look like? We create the thing. We go through and we build it up. You see how we're doing that. We do our benchmark. And we destroy it, and
this is normal destruction, 'cause that's all we can
do with a global allocator. We can't do that winking
stuff with a global allocator because it's got all the memory that if we winked it we'd have nothing. Might as well end the program. Next … I'm going slowly. Is this too slow? Do I need to pick it up a little bit? I just wanna know.

A little bit faster? Okay, I see somebody
say a little bit faster. Anybody think that's gonna cause, are we okay if I go a little faster? – [Man in Audience] Yeah. – All right. I just don't wanna lose you. It's a little bit of stuff here. All right, monotonic
allocators, what are they? Well, they're this beast. A monotonic allocator is
intended to be used, typically, with a buffer on the program stack. Here we have an example. I'm calling it
bdlma::BufferedSequentialAllocator This is what we have.

This was pre standardization of 17. We've had this forever. This is a really useful thing. We create a buffer on the stack of 1,024. There it is. And then, this is the buffer. Now, people know about
natural alignment maybe. So if I wanna put a char
in here, what happens? It goes there. Now I wanna put another char. Where does it go? It goes there. Then, another char. Where does it go? Goes there. Now I wanna put a short,
and it goes there. Now I wanna put an int, and it goes there. Do you see? It has to be aligned on its address that's divisible by its size. – [Man in Audience] Sorry. Doesn't the standard require alignment– – [John] I can't hear you. Louder. – [Man in Audience] Doesn't
the standard require alignment on the largest
possible, size which is (mumbles) – Okay, you asked a question, does the standard require alignment? Maybe you're talking about allocators, but I'm talking about a
buffer that's raw memory, and I can put it wherever I want it.

So what's gonna happen
in this particular thing, it's gonna do natural
alignment 'cause I said so. And it works. That's what this is gonna do. You could do it the
way you just described, which is to waste a lot of space, but I'm not gonna do
that, 'cause it's just me. So anyway, so then you
can put the short here. Then you can put the char here. But if I get another
int, it has to go here. I'm giving you an idea of what the most efficient thing is. You can back it off to your satisfaction, whatever you wanna do. That's the most efficient. We do things that work. Now, note that if you try
to deallocate something, it's a no op. Here we have, we can do
it as a type parameter. We know what that means. We can do it as normal destruction, so what does that look like? Normal destruction, I'm saying
we have this wrapper thing, it's this thing that holds
a pointer to the thing. It's just a way to build it into the type.

It's just really, it's an
object that's a pointer. And then we have this
buffered sequential allocator. It's a monotonic allocator. We can supply stuff on the stack, we can do whatever we want, but right now, that's just gonna do, it's gonna not allocate memory at all. It's not gonna point to a buffer. It's gonna go to that dynamic thing because of the way I did it. But we could also say, "Here,
please use this to start." So we have some flexibility there, and we're gonna go through,
and we're gonna allocate, and we're gonna do the new separately. I'm doing them separately 'cause I want you to
see the two operations. First thing we do is
we allocate the memory. Second thing is we construct into the memory we just allocated. Then we come down here
to do normal destruction, and you notice the first
thing we do is destroy it. We destroy it, and then we deallocate it.

So allocate, construct,
deallocate, destroy. Right? That's the normal pattern. That's the normal way. Then there's the magically winked way. So this time, when A goes out of scope, we just destroy everything, and then when A goes out
scope, the memory is reclaimed, so we don't have to do that. But the crazy thing is that we
don't need to do any of that. It's legal C++ to just
let those things go away. And you say, "No, it's not." And I'm gonna say, "Yes, it is." And then you're gonna go look it up and you're gonna go, "Well, okay, it's legal,
but it really," yeah. (audience laughs) So normal destruction and
magically winked are the same, and now we have multipool. Now, multipool is a
completely different beast. And it's backed by the global allocator. So what is a multipool? It's got a bunch of
different dynamic pools. It's like, I call it a beanstalk, but they're all these different, I'm gonna call them dynamic pools, but they're designed for different sizes.

So now … What happened there? That's not right. Let me try this again. This clicker is so sad. I'm sorry. So this is what happens. So during the course of, you saw with the end picture, but this is what's gonna happen. Over the course of running this thing, we're gonna keep allocating, and you'll notice the size, the chunk size that we get
each time is gonna double. And you see how we're
using up the chunk size 'cause they're two iterators, and it's just going until it hits the end. And then when it hits the end, we just create another
thing that's twice as big and so on like that. Now, here we have it saturating, but in real life it wouldn't saturate because it turns out
with the two iterators there's no reason not to
keep doubling forever.

Well, when you just populated a free list and went in and loaded the free list, there needed to be a saturation point. And I wanna thank Pablo
Halpern for pointing out to me at C++Now about four years ago that what I was doing here is brain dead, and thank you very much, Pablo. So it's now even faster. It's not so much faster that
you could really notice it, but it's aesthetically
faster, and it's better. (audience laughs) So anyway, if you exceed the maximum size, then all of this stuff is gonna come straight from the allocator,
and that's very important, because if it comes straight
from the backing allocator, there's no buffering, there's no pooling.

It's as if there's no allocator at all. So if you have a pooling allocator and you allocate a megabyte, you don't have a pooling allocator. You're just calling straight through. And that's gonna turn out to be important. All right. So, we've talked about type
parameter and abstract base, and they're really the same. And then we have the combination, and the combination is simply a multipool backed by a sequential or a
monotonic backed by the global, and it has, of course, you
can do it by type parameter. Wait. Come on. Wait a second. Sorry. My apologies. I got a new clicker
'cause I lost my old one, and I don't know to use this one properly. Let me just come back here.

So we have this one. So now, we have type parameter, and then we have magically winked. Did I get that right? Yeah. All right. So anyway, it's the same kind of thing, rather than go through the whole thing. It's just the same kind of pattern. You've gotten all the information. It's just I could repeat the same steps, so I'm not gonna do that. What did I miss here? So we're gonna have basic parameters to use to characterize our experiments. And the two we're gonna use are N for the number of instructions.

So a program is gonna be assigned a size by the number of instructions it executes. And again, this is very vague. And it's also gonna be based on the number of threads, worker threads. Now, most of the
experiments we're gonna do don't involve multiple threads, so that's just there sort of at the end. So the next question is, what aspects of software affect
optimal allocation strategy? And this is kind of like Family Feud. So I've got these five things, and the first letter is on the front, and it'll turn out that these letters are gonna form a mnemonic. So can anybody think what
those letters might mean? I'm not gonna take a lot of time, 'cause you'll never guess. Can you think of some? Just try it. What might matter? Can anybody think of a
dimension that might matter? 'Cause this is something
we had to think about. We wanted to think about, what dimensions of a problem would we come up with so that we could say it's got a lot of that
or a little of that, and then based on that, we could make some sort of recommendation? – [Man in Audience] Fragmentation.

– Fragmentation, excellent. – [Man in Audience] It's not up there. – It's not up there,
but we'll get to that. – [Man in Audience] Duration. – What? – [Man in Audience] Duration, how much time you're holding memory. – Okay, duration. It's gonna turn out that that's U. (audience laughs) All right. So here we have density
of allocation operations. You can imagine a program that does a lot of allocating
and deallocating, right? You could imagine a program
that doesn't do that at all. That's a dimension. Another one is variation
in allocated sizes. You can imagine a program that allocates one size and that's it, or it allocates, every time
it allocates a different size.

Right? You could imagine locality
of accessing memory as being a dimension. Is that important? Is the usage pattern of your program such that you could
exploit locality or no? How about utilization of allocated memory? That means, do I at one shot get all the memory I need,
use it, and throw it away, or do I get a little, give it back, get a little, give it back,
get a little, give it back? Right? That's more along the lines of the way in which you use the memory. Is it all at once, or is it piecemeal? And then finally, just contention
of concurrent allocations. So these are five dimensions. You might say, "Those are really stupid." But if you don't know anything and you have to come up
with five dimensions, at least I got five dimensions.

That took weeks. There you go. These are intended to be rough
indications, not precise. I wanna put them on a scale of zero to one where zero is not much and, this thing did it to me again. So they fit into that thing. Zero is minimal. One is maximal. They're far from independent. So just because I have five doesn't mean that they're
independent variables. They are not independent variables. Now, it would've been nice if I could've come up with an experiment that I could vary each one independently in just one experiment.

That would've been nice,
but that didn't happen. So anyway, density is the number of allocation, deallocation ops over the number of operations. And so zero means we
don't use an allocator, and one means every instruction is allocating or deallocating. (audience laughs) So, if you think about populating
a vector with push back, so what happens there? Think about what happens.

If I just say push back, push
back, push back on a vector to a billion, does that
have a high allocation or a low allocation? And I'm specifically saying vector of int. Would you say that has a
high allocation density or a low allocation density, where I push back a billion integers? What? – [Man in Audience] It's relatively low. – Relatively low. What do you guys think, low? – [Man in Audience] A little bit low. – Yes, sir. – [Man in Audience] Low. – We're all low. Right, because if I push
a billion things on, there are gonna be 30
allocations, let's say, and then there are a billion other things, so 30 over a billion, small. Just giving you an idea.

Then variation in allocated sizes. So everything is allocated the same. And then everything is different. Make sense? Okay. So consider like STD set
of int or STD string. Just think about those guys. Do they have variation
or are they all the same? If you have a set of int,
every node is the same. If you have strings, it could vary all over the place, right? The footprint of the string is the same, but if it's a long enough string, then it could vary to whatever it is.

Now locality is the most complex one. Locality, you have to think about the number of instructions in
a sub-region over a duration, and then you have the memory
size of the sub-region, and you have the number of transitions out of that during this period. And so if you look at those, they form this little,
first order equation. I don't mean this to be really linear, but to a first order or a
zeroth order, this is the idea. And then you can break
into physical locality and forget about the T. You can break it into temporal locality and forget about the M. And again, I'm just trying
to give you the idea that the more instructions,
the more locality, and the smaller the
memory, the more locality, and the fewer times you
jump out of that region, the more locality.

I'm just giving you the rough
way the thing fits together. So, zero is, we've got a large
and not accessed repeatedly. That's what most programs are, large and not accessed repeatedly. I mean, large or not
accessed repeatedly, right? It's large. The region is the whole memory. Or it's small, but it
goes all over the place. The other one is saying
that the sub-region is small and accessed repeatedly. If I've got a small region, not all of memory but it
has a little bit of memory, and I'm pounding on it,
that's high locality. That's good.

Okay. Even if we don't allocate
a lot in our program, even if we just do a
little bit of allocation and then spend the next
seven days accessing memory, the performance improvement
of local allocators can be amazing. I just want you to understand that it's not about the
speed of allocation. That's another myth, that, "Oh, I need to
allocate really quickly." No, you don't. You need to allocate well.

Okay. Utilization, the max memory in use divided by the total memory allocated. If I allocate a byte and free it for the entire duration of the program, my utilization is very low. If I allocate everything up front, use it, and blow it away, it's very high. Does that make sense? We're good? Finally, again, consider vector of int and assume that we allocate
the capacity up front. If I allocate the capacity
of a vector up front, I got one piece. And whatever I do to it, nothing
else is getting allocated. It's done, and then I free it. So that's everything up front. Whereas a vector of large strings, if I were to push and pop and
push and pop and push and pop, I wouldn't be getting
all the memory up front. I'd be getting some memory. Then I push and I pop and
then I get some more memory, so the amount of memory that
I've allocated is much larger than the amount of memory I
have at any given point in time.

And finally, there's contention. And contention really doesn't fit this, but you like to try to
fit things into something, so this is the fifth category. If I have no contention
at all, then it's zero. And if every thread is constantly trying to access the thing, or I should say is trying to
allocate or deallocate memory, then I have maximal contention. All right. So we have this summary here. And if you notice, the
mnemonic is going to be DVLUCK the duck. So I hope you remember it. Density, variation, locality,
utilization, and contention. DVLUC. I'll make it even easier later. Okay, so remember that. So we're now at the next step,
analyzing the benchmark data.

And of course, this is what
we were all waiting for, but we're almost out of time. I'm going to start, because I wanna make sure
I have time for questions, but does anybody have
a question right now? As long as it's not heckling. (audience laughs) I'm waiting. I'm saving the heckling for
the end, 'cause I'm scared. Yes. – [Man in Audience] With
monotonic allocators, how do you decide the size? – With allocators, how do I decide? – [Man in Audience] The size
of the monotonic allocator? – The question is, with
the monotonic allocator, how do I decide how big to make it? – [Man in Audience] On the stack. – On the stack. Okay. The way you decide is you figure out how big it needs to be,
and you make it that big. (audience laughs) Yes. – [Man in Audience] If you use pragma pack would that also affect your allocators if you're using the previous allocator, the (mumbles) allocator where you're– – Okay, the question is
something about pragma– – [Man in Audience] Pack. – Pack. I don't think I even know what that is.

What is that? – [Man in Audience] Like naturally, would you align them back to back? – So the question is, if I use this thing that
I've never heard of, will it make things– (man in audience mumbles) I will tell you now, I have
never heard of this thing, but if the question is, do I need a compiler
pragma to do something, I think what you might be asking is, I think I might I know, I think I'm gonna figure
it out with the question. So let me say what I
think the question is. If I turn off alignment in the compiler so that everything just mushes together and I don't have to worry
about any holes, is that right? – [Man in Audience] Yes. – Ah! Why didn't you say so? (audience laughs) The answer is your program will
run like a dog with one leg, because that's not how computers work.

That's how compilers work, but that's not how computers work. So we don't do that. Don't do that. Do not turn off alignment so
that you get this packing. Yes. – [Man in Audience] I'm sorry. Just to clarify, if you
are running x86 or x64, its performance impact
is very, very small. – Okay. We just heard from somebody who knows. On an x86, the performance impact is very small, but not zero? Okay. – [Man in Audience] On anything else, it is completely (mumbles) – Okay. So what we heard is it's
either not as good or horrific. (audience laughs) I'll take that. Anybody else? That's good to know, I guess. Yes. – [Man in Audience] So when
you're measuring the locality, how do you determine the
size of the memory region? 'Cause you said N was like
the size of the memory region. – Okay, so the question is,
how do you measure locality? And the answer is you don't. The idea is, this is a model for a
human being to think about, what are the trade offs and so on? The idea is you measure
various system sizes, and you say, "The system size is this big.

"What happens if this big and so on?" And you compare, and
it's like an accordion. You say, "I got one system that's big, "two systems that are like this, "four systems that are like this. "They're all the same size,
but they're segmented." And you look and see what
the surface looks like.

And guess what we're gonna do? So that's happening. I want you to know that this talk did not happen in a couple of days. This talk is serious. We have about three minutes left. Maybe, I don't know if I should start, or if I should just break now and we'll come back maybe
three minutes earlier. Let's start on time, 'cause we literally have two
minutes or something left. Does anybody have anything
else they wanna ask? Yes. – [Man in Audience] Is there anything about the monotonic allocator that says its buffer has to be on the stack? – Question is, does
the monotonic allocator say anything about where
the buffer comes from? The answer is no. You can specify whatever you want. Note to self, stack is hot. Yes. – [Man in Audience] Is there a such thing as a backup strategy for
a monotonic allocator? – Is there a backup strategy
for a monotonic allocator? Typically, it is the backup strategy, but if you wanted to put
something that measured how much memory was going through, like a monitoring allocator,
you can chain them.

Anything else? We are gonna start sharp in 15 minutes, 'cause I'm going to get
through the rest of this. The good stuff is now about to happen, some crazy good stuff. So go up, but don't be late..

You May Also Like