Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.
> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.
This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.
This might be useful in a case where a small lookup or similar is often pushed out from cache such that lookups are usually cold. Yet lookup data might by small enough to not cause issue with cache pollution, increased bandwidth or memory consumption.
It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.
There’s no point racing DRAM reads explicitly. Refreshes are infrequent and the penalty is like 5x on an already fast operation, 1% of the time.
What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.
There are already systems that do this in hardware. Any system that has memory mirroring RAS features can do this, notably IBM zEnterprise hardware, you know, the company that this video promoter claims to be one-upping.
The memory controller sends the read to the DIMM that is not refreshing. It is invisible to software, except for the side-effect of having better performance.
Mirroring is more of a reliability feature though, no? From my understanding it’s like RAID where you keep multiple copies plus parity so uncorrectable errors aren’t catastrophic. Makes sense for mainframes which need to survive hardware failures.
Refresh avoidance is a tangential thing the memory controller happens to be able to do in a scheme like that, but you’d really have to be looking at it in a vacuum to bill it as a benefit.
Like I said, it’s all about cache. You’re not going to DRAM if you actually care about performance fluctuations at the scale of refresh stalls.
Clearly, hitting a cache would be the better outcome. The technique suggested here could only apply to unavoidably cold reads, some kind of table that's massive and randomly accessed. Assume it exists, for whatever reason. To answer your question, refresh avoidance is an advertised benefit of hardware mirroring. Current IBM techno-advertising that you can Google yourself says this:
"IBM z17 implements an enhanced redundant array of independent memory (RAIM) design with the following features: ... Staggered memory refresh: Uses RAIM to mask memory refresh latency."
I can google, thanks. My point is that nobody is buying mainframes with redundant memory to avoid refresh stalls. It’s a mostly irrelevant freebie on hardware you bought for fault tolerance.
> Assume it exists, for whatever reason.
You need a data structure that’s too large for cache, randomly accessed enough that prefetching doesn’t help, so latency-sensitive that 250ns every 15us matters, but not so latency-sensitive that you’d invest in making it fit in cache instead. And on top of all that, you’re willing to double its memory footprint, making the cache situation worse for everything else on the system. I think we both know that’s too many contradictions to hold at once.
Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.
Yeah exactly, and it’s absolutely dwarfed by the tail latency of going to DRAM in the first place. A cache miss is a 100x tail event vs. an L1 hit. The refresh stall is a further 5x on top of that, which barely registers if you’re already eating the DRAM cost.
It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.
You got downvoted for being an asshole, and if you continue to be an asshole on HN we are going to ban you. I suppose you don't believe this because we haven't done it yet even after countless warnings:
The reason we haven't banned you yet is because you obviously know a lot of things that are of interest to the community. That's good. But the damage you cause here by routinely poisoning the threads exceeds the goodness that you add by sharing information. This is not going to last, so if you want not to be banned on HN, please fix it.
A more accurate but less inspiring title would be:
RAM Has a Design Tradeoff from 1966. I made another one on top.
The first tradeoff, of 6x fewer transistors for some extra latency,
is immensely beneficial. The second, of reducing some of that extra latency
for extra copies of static data, is beneficial only to some extremely niche application. Still a very educational video about modern memory architecture.
[EDIT: accidental extra copy of this comment deleted]
This is very much worth watching. It is a tour de force.
Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.
Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.
She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.
This is a 54 minute video. I watched about 3 minutes and it seemed like some potentially interesting info wrapped in useless visuals. I thought about downloading and reading the transcript (that's faster than watching videos), but it seems to me that it's another video that would be much better as a blog post. Could someone summarize in a sentence or two? Yes we know about the refresh interval. What is the bypass?
"Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls.
"It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules, using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton. Once the request comes in, Tailslayer issues hedged reads across all replicas, allowing the work to be performed on whichever result responds first."
FYI if you have a video you can't be bothered watching but would like to know the details you have 2 options that I use (and others, of course):
1. Throw the video into notebooklm - it gives transcripts of all youtube videos (AFAIK) - go to sources on teh left and press the arrow key.
Ask notbookelm to give you a summary, discuss anything etc.
2. Noticed that youtube now has a little Diamond icon and "Ask" next to it between the Share icon and Save icon. This brings up gemini and you can ask questions about the video (it has no internet access). This may be premium only. I still prefer Claude for general queries over Gemini.
> using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton
Seems odd to me that all three architectures implement this yet all three leave it undocumented. Is it intended as some sort of debug functionality or what?
you could however link to the timestamp where that particular explanation starts. i am afraid i don't have time to watch a one hour video just to satisfy my curiosity.
The actual explanation starts a couple minutes later, around https://youtu.be/KKbgulTp3FE?t=1553. The short explanation is performance (essentially load balancing against multiple RAM banks for large sequential RAM accesses), combined with a security-via-obscurity layer of defense against rowhammer.
Not complaining about the particular presenter here, this is an interesting video with some decent content, I don't find the presentation style overly irritating, and it is documenting a lot of work that has obviously been done experimenting in order to get the end result (rather than just summarising someone else's work). Such a goofy elongated style, that is infuriating if you are looking for quick hard information, is practically required in order to drive wider interest in the channel.
But the “ask the LLM” thing is a sign of how off kilter information passing has become in the current world. A lot of stuff is packaged deliberately inefficiently because that is the way to monetise it, or sometimes just to game the searching & recommendation systems so it gets out to potentially interested people at all, then we are encouraged to use a computationally expensive process to summarise that to distil the information back out.
MS's documentation the large chunks of Azure is that way, but with even less excuse (they aren't a content creator needing to drive interest by being a quirky presenter as well as a potential information source). Instead of telling me to ask copilot to guess what I need to know, why not write some good documentation that you can reference directly (or that I can search through)? Heck, use copilot to draft that documentation if you want to (but please have humans review the result for hallucinations, missed parts, and other inaccuracies, before publishing).
The video definitely wouldn't be over 50m if she was targeting views. 11m -15m is where you catch a lot of people repeating and bloviating 3m of content to hit that sweet spot of the algorithm. It's sad you can't appreciate when someone puts passion into a project.
This is the damage AI does to society. It robs talented people of appreciation. A phenomenal singer? Nah she just uses auto tune obviously. Great speech? Nah obviously LLM helped. Besides I don't have time to read it anyway. All I want is the summary.
Yes, I do want the summary because my time is (also) valuable. There is a reason why book covers have synopses, to figure out whether it's worth reading the book in the first place.
I don't consider AI to threaten "damage to society" the way you seem to, but I did find it interesting to think about how ridiculously well-produced the video was, and what that might signify in the future.
I kept squinting and scrutinizing it, looking for signs that it was rendered by a video model. Loss of coherence in long shots with continuity flaws between them, unrealistic renderings of obscure objects and hardware, inconsistent textures for skin and clothing, that sort of thing... nope, it was all real, just the result of a lot of hard work and attention to detail.
Trouble is, this degree of perfection is itself unrealistic and distracting in a Goodhart's Law sense. Musicians complain when a drum track is too-perfectly quantized, or when vocals and instruments always stay in tune to within a fraction of a hertz, and I do have to wonder if that's a hazard here. I guess that's where you're coming from? If you wanted to train an AI model to create this type of content, this is exactly what you would want to use as source material. And at that point, success means all that effort is duplicated (or rather simulated) effortlessly.
So will that discourage the next-generation of LaurieWireds from even trying? Or are we going to see content creators deliberately back away from perfect production values, in order to appear more authentic?
I like the video because I cant read a blog post in the background while doing other stuff, and I like Gadget Hackwrench narrating semi-obscure CS topics lol
this is a thing people do. convince themselves they can consume technical content subconsciously. its now how the brain works though. it will just give you the idea you are following something.
not all technical content is the same, or has the same level of importance. this video does not introduce anything that i need to be able to replicate in my work, so i don't need to catch every detail of it, just grasp the basic concepts and reasons for doing something.
Lots of people will have a show on or something while they're cooking or cleaning or doing other things. Is it worse for it to be interesting technical content with fun other stuff thrown in than if was an episode of Friends or Fraiser or Iron Chef or 9-1-1: Lone Star or The Price is Right?
I guess I'm only allowed to have The Masked Singer on while I make dinner.
FWIW, I like her videos but I usually prefer essays or blog posts in general as they're easier to scan and process at my own rate. It's not about this particular video, it's about videos in general.
I get a similar feeling for when friends send me 2minute+ Instagram reels, it's as if my brain can't engage with the content. I'd much rather read a few paragraphs about the topic, and It'd probably take less time too.
Same; thanks to modern technology, videos can be transcribed and translated into blog posts automatically though. I wish that was a default and / or easier to find though.
For years I've been thinking "I should watch the WWDC videos because there's a lot of Really Important Information" in there, but... they're videos. In general I find that I can't pay attention to spoken word (videos, presentations, meetings) that contain important information, probably because processing it costs a lot more energy than reading.
But then I tune out / fall asleep when trying to read long content too, lmao. Glad I never did university or do uni level work.
>> It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules
This is the sort of thing which was done before in a world where there was NUMA, but that is easy. Just task-set and mbind your way around it to keep your copies in both places.
The crazy part of what she's done is how to determine that the two copies don't get get hit by refresh cycles at the same time.
Particularly by experimenting on something proprietary like Graviton.
Access optimization or interleaving at a lower level than linearly mapping DIMMs and channels. x86 cache lane size is 64 bytes, so it must be a multiple. Probably 64*2^n bytes.
EPYC chips have multiple levels of NUMA - one across CCDs on the one chip, and another between chips in different motherboard sockets. As a user under Linux you can treat it as if it was simple SMP, but you’ll get quite a bit less performance.
Home PCs don’t do NUMA as much anymore because of the number of cores and threads you can get on one core complex. The technology certainly still exists and is still relevant.
I hope this approach gets some visibility in the CPU field. It could be obviously improved with a special cpu instruction which simply races two reads and returns the first one which succeeds. She’s doing an insane amount of work, making multiple threads and so on (and burning lots of performance) all to work around the lack of dedicated support for this in silicon.
The results are impressive, but for the vast, vast majority of applications the actual speedup achieved is basically meaningless since it only applies to a tiny fraction of memory accesses.
For the use case Laurie mentioned - i.e. high-frequency trading - then yes, absolutely, it's valuable (if you accept that a technology which doesn't actually achieve anything beyond transmuting energy into money is truly valuable).
For the rest of us, the last thing the world needs is a new way to waste memory, especially given its current availability!
This is a quite old technique. The idea, as I understood it, was that lots of data at Google was stored in triplicate for reliability purposes. Instead of fetching one, you fetched all three and then took the one that arrived first. Then you sent UDP packets cancelling the other two. For something like search where you're issuing hundreds of requests that have to resolve in a few hundred milliseconds, this substantially cut down on tail latency.
Aha that makes more sense, I thought it was specifically to do with job scheduling from the description. You can do something similar at home as a poor man's CDN by racing requests to regionally replicated S3 buckets. Also magic eyeballs (ipv4/v6 race done in browsers and I think also for Quic/HTTP selection) works pretty much the same way
https://en.wikipedia.org/wiki/Happy_Eyeballs is the usual name. It's not quite identical, since you often want to give your preferred transport a nominal headstart so it usually succeeds. But yes, there are some similarities -- you race during connection setup so that you don't have to wait for a connection timeout (on the order of seconds) if the preferred mechanism doesn't work for some reason.
Request hedging or backup requests are indeed the terms
I know for requests where you give the first request a bit of a headstart. I didn’t know about the term happy eyeball to signify that all requests fire at the same time.
Almost everything "new" was invented by IBM it seems like. And it goes by a completely different name there. It's still nice to rediscover what they knew.
everyone says this but no one says why it was clever. i find her videos have cool results but i cant have patience for them usually because its recycled old stuff (can be cool but its not ground breaking).
there is a ton of info you can pull from: smbios, acpi, msrs, cpuid etc. etc. about cpu/ram topology and connecticity, latencies etc etc.
isnt the info on what controllers/ram relationships exists somewhere in there provided by firmware or platform?
i can hardly imagine it is not just plainly in there with the plethtora info in there...
theres srat/slit/hmat etc. in acpi, then theres MSRs with info (amd expose more than intel ofc, as always) and then there is registers on memory controller itself as well as socket to socket interconnects from upi links..
its just a lot of reading and finding bits here n there. LLms are actually really good at pulling all sorts of stuff from various 6-10k page documents if u are too lazy to dig yourself -_-
The exact mapping between RAM addresses and memory controllers is intentionally abstracted by the memory subsystem with many abstraction layers between you and the physical RAM locations.
Because documentation is sometimes incomplete or proprietary, security researchers often have to write software that probes memory and times the access speeds to reverse-engineer the exact interleaving functions of a specific CPU. in the video she says that ARM CPUs have the least data about this and she had to rely entirely on statistical methods.
I suppose hardware folks could cook up some kind of RAID 5 style striped RAM layout and utilize a hedging strategy. Write out parity, drop a late parity read instead of using it for error correction. You get better results than double the RAM needed with a striping strategy.
I guess this would have been a nice way to reduce HDD latency as a new RAID mode... oh well.
LaurieWired is so incredibly smart, and so incredibly nerdy :-D
Really enjoyed this video, and I'm pretty picky. I learned a lot, even though I already know (or thought I knew) quite a bit about this subject as it was a particular interest of mine in Comp Sci school. I highly recommend. Skip forward through chunks of the train part though where she is messing around. It does get more informative later though so don't skip all of the train part
Halfway through this great video and I have two questions:
1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.
2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.
1. not that I can think of, due to the core split. It really has to be independent cores racing independent loads. anything clever you could do with kernel modules, page-table-land, or dynamically reacting via PMU counters would likely cost microseconds...far larger than the 10s-100s of nanoseconds you gain.
what I wished I had during this project is a hypothetical hedged_load ISA instruction. Issue two requests to two memory controllers and drop the loser. That would let the strategy work on a single thread! Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation. But, you’d have to convince Intel/AMD/someone else :)
2. It’s called a “smokeninja”. Fairly popular in product photography circles, it’s quite fun!
Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation.
Yeah it would be neat to just flip a BIOS switch and put your memory into "hedge" mode. Maybe one day we'll have an open source hardware stack where tinkerers can directly fiddle with ideas like this. In the meantime, thanks for your extensive work proving out the concept and sharing it with the world!
If you're able to do it at the memory controller level, would it be as simple as making two controllers always operate in lock-step, so their refresh cycles are guaranteed to be offset 50% from one another?
Given that the controller can already defer refresh cycles, and the logic to determine when that happens sounds fairly complex, I suspect that might already be in CPU microcode.
...which raises the tantalizing possibility that this lockstep-mirrored behavior might also be doable in microcode.
I can not think of any reason they would not want to do it.
However, I do seem at least 2 downsides to this method.
Number one it is at least 2x the memory. That has for a decently long time been a large cost of a computer. But I could see some people saying 'whatever buy 8x'.
The second is data coherency. In a read only env this would work very nicely. In a write env this would be 2x the writes and you are going to have to wait for them to all work or somehow mark them as not ready on the next read group. Now it would be OK if the read of that page was some period of time after the write. But a different place where things could stall out.
Really liked her vid. She explained it very nicely. She exudes that sense of joy I used to have about this field.
> halve my effective memory in order to completely eliminate the tail latency problem,
Wouldn't you have a tail latency problem on the write side though if you just blindly apply it every where? As in unless all the replicas are done writing you can't proceed.
Indeed. And only for certain DRAM refresh strategies. I mean, it's at least conceivable that a memory management system responsible for the refresh notices that a given memory location is requested by the cache and then fills the cache during the refresh (which afaiu reads the memory) or -- simpler to implement perhaps -- delays the refresh by a μs allowing the cache-fill to race ahead.
This is a cool idea, very well put through for everyone to understand such an esoteric concept.
However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.
Voxel Space[1] could have used this, would that multicore had been prevalent at the time. I recall being fascinated that simply facing the camera north or south would knock off 2fps from an already slow frame rate.
Many of our maps' routes would be laid out in a predominately east or west-facing track to max out our staying within cache lines as we marched our rays up the screen.
So, we needed as much main memory bandwidth as we could get. I remember experimenting with cache line warming to try to keep the memory controllers saturated with work with measurable success. But it would have been difficult in Voxel Space to predict which lines to warm (and when), so nothing came of it.
Tailslayer would have given us an edge by just splitting up the scene with multiprocessing and with a lot more RAM usage and without any other code. Alas, hardware like that was like 15 years in the future. Le sigh.
Delta Force's programmer was really boss (Daniele Gaetano), a physics guy turned coder if I remember correctly. He rewrote Voxel Space to be true'ish 3D and not fakey 2.5D. He explained the innovative backtracking he had to do on the ray caster to make that work, implemented mipmaps in the voxels, very very clever guy. One of the friendlier guys I've known in videogames.
I did the first version of the matchmaking for the network play in Delta Force but didn't make it into the credits because I quit before it shipped. My psycho coworker built a custom web browser(!) that integrated directly with my from-scratch matchmaking server. At least they let me work in C for that project; most everything else I had to do for them was assembler because that was not a "sissy" programming language. That server code was by-far the coolest thing I wrote for many years afterward.
Unfortunately, my server code couldn't handle more than like 32 concurrents because the Windows NT 3.0 kernel would BSOD with more. My (extremely grumpy) manager and the Sega Saturn coder called me a few days after I had quit to ask how the code worked. I suspect I left data in the socket buffers too long (was trying to batch up my message broadcasting work at regular intervals) and the kernel panicked over that.
I recall learning later the TCP/IP stack was homegrown in NT by Microsoft at that time and they licensed a good one for later versions, so I can't be blamed, it wasn't me! :D
I was just the lowly build-master/installer/utility developer, but I got tapped for testing and debugging and performance because I was just a sponge for coding knowledge because I wanted to be a game developer so baaaad at the time. I didn't get to do any of the game coding, and my experiments were just fruits of conversations with benevolent sages.
The reason facing east-west (or was it north-south, now I'm unsure) made such a difference in framerate was the color and height maps were ray marched in straight lines up from the bottom of the screen to the horizon. This meant you were zipping through the color map in straight lines, wrapping around to the other side if the ray went far enough.
When those straight lines lined up with the color and height map (north-south), life was good (and when a ray marched up a sheer canyon wall, life was VERY good.) But, when those straight lines went perpendicular (east-west) to the color and height map, you were blowing through the L2 cache constantly and going to main memory very often. I imagine on modern hardware these cache misses wouldn't amount to much measurable time, but on a 386dx with 8megs of RAM, the impact was very clear.
Novalogic was the only programming job I ever had where I got my own office with a door. ;) When I was with them, they had a policy of one game developer per game which I never saw again. Maximum cowboy coder energy, good times.
It halves (or thirds or quarters or etc) available CPU cores, cache space, memory bandwidth, all the critical resources. So I expect that it's only applicable for small reads that you are reasonably certain won't be in cache and that it can only be used extremely sparingly, otherwise it will be nothing but a massive drain.
I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?
She probably is already stinking rich, or at least rich enough. Beyond certain point, though, research and knowledge seems more interesting than riches, and particularly if you feel yourself a researcher. Otherwise, perhaps, she be doing the same to business and be Ellona or something. Thank God she does not, but the contrary - is an inspiration to so many people - young and adult. Kudos!
For an HFT firm, RAM cost is a non-issue. Even the tiniest improvement in latency can result in millions of dollars of extra profit. They can octuple their RAM usage and still make a killing.
I bet Citadel already has reached out to Laurie :)
This is not a problem which needed to be fixed, it's an improvement in efficiency - though a costly one. We are talking about a company which do make their own custom microchips so you could very well be right, but it may also be that they weren't even aware this was possible.
Citadel executes trades in about 10 microseconds, so a 500 nanosecond reduced execution time is a 5% improvement. For a company which executes trades for hundreds of billions a day, this translates to real money.
Your sarcasm indicates that you have no clue as to how important such an improvement can be for some actors. Some do though; the repo has almost 100 forks and 2K stars after just two days.
I saw a few years ago one group buying spools of fiber just to 'slow down' the trades. As they were submitting them to different datacenters across the country. They wanted everything to show up at the exact same time so no one would front run their trades on different datacenters. They are willing spend millions on HW if it gives them an edge in the market. They would buy bespoke boards that could hold 16x the RAM if it gave them a 50ns edge.
I have become oversensitive to this, and my brain is probably generating a lot of false positives. I don't think it's necessarily the case here, but I've wondered if people who use LLMs a lot take over some of its idiosyncrasies and in a way start sounding like one a bit. A strange side effect is that I've come to appreciate text with grammatical errors, videos where people don't enunciate well etc because it's a sign that it's human created content.
I think it's more people being fascinated by this curious architectural detail.
I imagine it's fascinating to people who are not exposed to the intricate details of computer architecture, which I assume is the vast majority here. It's a glimpse into a very odd world (which is your day-to-day work in the HFT field, but they rarely talk about this, and much less in such big words).
TBH, I didn't watch the video because the title is too click-baity for me and it's too long. Instead, I looked at the benchmark results on the Github page and sure, it's fascinating how you can significantly(!) thin the latency distribution, just by using 10× more CPU cores/RAM/etc. Classic case of a bad trade-off.
And nobody talked about what we use RAM for, usually: Not to only store static data, but also to update it when the need arises. This scheme is completely impractical for those cases. Additionally, if you really need low latency, as others pointed out, you can go for other means of computation, such as FPGAs.
So I love this idea, I'm sure it's a fun topic to talk about at a hacker conference! But I'm really put off by the click-baity title of the video and the hype around it.
You're absolutely right to call this out. No humans, no emotion, no real comments - just LLM slop.
In all seriousness, agreed. The top comment at time of this writing seems like a poor summarizing LLM treating everything as the best thing since sliced bread. The end result is interesting, but neither this nor Google invented the technique of trying multiple things at once as the comment implies.
No, something is funny here. In the previous submission (https://news.ycombinator.com/item?id=47680023) the only (competently) criticizing comment (by jeffbee) was downvoted into oblivion/flagged.
Well he veered off of the technical and into the personal so I'm not surprised it's dead. But yeah something feels weird about this comment section as a whole but I can't quite put my finger on it.
I think rather than AI it reminds me of when (long before AI) a few colleagues would converge on an article to post supportive comments in what felt like an attempt to manipulate the narrative and even at concentrations that I find surprisingly low it would often skew my impression of the tone of the entire comment section in a strange way. I guess you could more generally describe the phenomenon as fan club comments.
There are a few glazing comments there too though.
> Well he veered off of the technical and into the personal so I'm not surprised it's dead.
I don't know what he posted, but it is easy to see how a small fan group around Laurie can form?
She is an attractive girl not afraid to be cute (which is done so seldom by women in tech that I found a reddit thread trying to triangulate if she is trans. I am not posting that to raise the question, but she piques peoples interest) plus the impressively high effort put into niche topics PLUS the impressively high production value to present all that.
it was flagged because it was unnecessarily rude. nothing "funny" going on (with that comment chain at least).
i would note that it also appears to be wrong, reading laurie's reply, though i am not an expert. rude + wrong is a bad combo.
the next comment by jeffbee is also quite rude, and ignores most of laurie's reply in favor of insulting her instead. i dont think it is a mystery why jeffbee's comments were flagged...
Yeah, wow, the comments weren't kidding. This'll probably be the best video I watch all month, at least, if not more. I would have said what she was trying to do was "impossible" (had I not seen the title and figured … well … she posted the video) and right about when I was thinking that she got me with:
> Hold on a second. That's a really bad excuse. And technology never got anywhere by saying I accept this and it is what it is.