Landed multi-draw-indirect and bindless textures in Bevy 0.16! It cuts our CPU overhead for drawing in half on many scenes, and along with the retained render-world scene graph I also landed it brings us significantly closer to GPU-driven rendering.
Learned the other day: On GPU, inverse trigonometric functions are way more expensive than regular trigonometric functions. But on CPU, they're about the same speed (source: https://t.co/LYiDHquWIu)
One thing that would make Rust compiles faster in many cases is for Cargo to insta-kill any rust-analyzer processes when you run cargo build. That "Blocking" message wastes a lot of time.
Sorry, but at this point if you think Bevy's ECS is a burden for app logic I just don't believe you. You have to organize your code somehow. The question is whether the framework helps you do it or not.
Bevy 0.15 is out :) This time around, I wrote animation masks, additive blending, most of the Bevy Remote Protocol, chromatic aberration, point light and spot lights for volumetric fog, and PCSS.
Bevy 0.15 is out now! It features Required Components, Entity Picking, Generalized Entity Animation, Animation Masks, Curves, Function reflection, Bevy Remote Protocol, VBAO, OIT, Chromatic Aberration, Fog volumes, Better Text, and more!
https://t.co/BJRmESL09s
Wanted: a `#[derive_fast_hash]` that gives you a fast hash and compare implementation that goes as many bytes as it can at a time, for types with no padding anywhere (the macro would guarantee this somehow).
The iOS ecosystem has gotten so close to having true JIT support on iOS that I'm not sure that the delta between all the workarounds and true JIT support is that meaningful anymore. You can already JIT in wasm or JS, or you can interpret, or you can get a JIT entitlement...
@Barteks2x @Lazin I agree (and this is one point of disagreement I have with the author of that code in JSC :)) But I certainly believe it'd be way faster than what native mallocs do for small allocations.
Honestly, I feel like programmers are way too quick to assume that "allocation" = "slow" is an iron law of the universe. In the JVM it's like 5 instructions.
The problem with C++/Rust allocators is that they're tuned to the workloads they observe in programs... (1/2)
This is why Zig and Rust are sane and C++ is crazy
Just assigning a variable to another can cause heap allocations
I imagine std is littered with all these hidden allocations
Imagine how hard it must be to write performant software
@DrawsMiguel As I recall you also need to hack it to stop checking for tracing/logging at runtime, and some other things. There's a lot of little needless overheads in there.
(When I offered to fix it back then I was told "optimizing small allocation perf isn't important for our workloads".)
A lot of C++/Rust malloc overhead comes from the loosely coupled malloc(size_t) interface. For example, the allocator has to compute which bin to use at runtime, when most of the time the compiler knows the size and could precompute the bin offset.
I don't think 5 insns is feasible, but you might be able to get to around 10. Load TLAB, load bin, check to see if bump mode/pop mode/slow path, bump or pop as necessary.
Also consider malloc logging/tracing features. Very convenient! But it adds a runtime check on every allocation. When your point of comparison is ~5 instructions as in the JVM, those tiny branches add up.
@DrawsMiguel No, it's true with jemalloc too. When I measured jemalloc (which was a few years ago) it was still something like 80 instructions in the fast path. This is an order of magnitude difference compared with the JVM.
@Lazin Last I profiled jemalloc it was still like 80+ instructions even in the fast path. That's an order of magnitude difference. You really want to have the compiler start inlining the fast paths, like the JVM does.