Retroquad's Patreon was launched:
https://t.co/mdGdYZhfXl
It will feature content that is way more in-depth than my Twitter and YouTube accounts. All future Retroquad releases will be done through it.
If the minimum goal isn't reached, Retroquad will have to be cancelled.
@Dynasty1031@EnigmaticVayne It was one of the best videogame adaptations because most others were unbearably awful. But compared to the actually good movies of the time, it was definitely mid.
The first videogame adaptation that I remember being actually on par with the movies of the time was Tomb Raider.
@EnigmaticVayne@Dynasty1031 Mortal Kombat still deserves better. The first 90s movie was mid, and the second was horrible.
The previous movie in this new continuity started off great but they cheapened it with that "arcana" crap, which likely hurt this one's box office bc people didn't expect improvement.
İnsanlar için kaybetmenin acısı, aynı miktarı kazanmanın verdiği mutluluktan iki kat daha şiddetlidir.
Bu yüzden yatırım yaparken veya karar alırken asıl odak noktamız "kazanmaktan ziyade, kaybetmemektir."
@falco_girgis Silly unrelated question, would some sort of TAA be possible on the DC?
Also, would it be possible to achieve anisotropic filtering through ripmaps (anisotropic scaled mipmaps)? The only requirement is being able to set differing X and Y scaling on mipmaps.
To go along with the sequel announcement: for a limited time, the first game is free to grab on Steam! 🎮
✨That's FREE as in F-R-E-E, no strings attached!✨
Claim the game while the offer is available, and keep it forever!
Find the Steam page here:
https://t.co/rBWInwBUVG
@DiscussingFilm I just want to see a kangaroo and a 12-year-old on roller skates fighting 20 drug lords at a time and eating fully cooked turkeys right out of garbage cans
YEEEEES!!! GAINZ!!! Was up all night hand-rolling assembly routines for the Sega Dreamcast, playing SH4 instruction Tetris, with the goal of maximizing the gainz for the custom memcpy() replacement in my SH4ZAM accelerated math library.
We discovered not too long ago that a piss-simple for-loop that does a byte-by-byte copy, written in plain C can actually stomp on the Newlib memcpy() implementation we get backing our C standard library within our SH GCC toolchains, given -O3 is enabled with loop unrolling...
This is obviously unacceptable for a community of engineers looking to push a piece of retro hardware to its limits, so we set off to look for alternative implementations...
We managed to find an extremely efficient one, written by STMicroelectronics Ltd for the SH4, which we were absolutely thrilled with... only... the license? LGPLv2.1, which wasn't going to be fit to power an entire community of commercial and open-source homebrewers as our dedicated memcpy() replacement within our toolchains...
So I got sick of that shit and decided to embark on a quest to roll my own hand-optimized, generic memcpy() replacement, as part of my SH4ZAM library...
Fast forward through months and months of pain and misery, continuous benchmarking, constant bugfixes, and several iterations of rewrites, and I've FINALLY absolutely DESTROYED both of the generic memcpy() implementations by Newlib and STMicroelectronics!
What you're seeing here in the left and middle panes is the complete out-of-line ASM implementation for the most critical, highest-throughput, fastest copy path in the whole generic memcpy() implementation... The pathological "best-case scenario."
On the right, you can see the top-level dispatcher and entry-point for shz_memcpy(), which is basically written to assess the size of the buffer we're copying, along with source and destination buffer alignments, in order to determine the most efficient hand-written ASM path to forward the call onto, which will handle the bulk of the transfer.
If the given pointers are not ideally aligned, the algorithm does a slower copy on the remaining bytes at the beginning and end of the destination buffer (shz_memcpy1()), until the buffer becomes cache-line aligned... at which point it chooses the best specialization it can, based on the remaining number of bytes which need to be copied and source buffer alignment.
Switch back now to the left pane, and you're looking at the fastest fast path... which can be selected to handle this bulk-copy... It requires a copy size of 128-byte multiples, a destination buffer alignment of 32 bytes or greater, and a source buffer alignment of 8 bytes or greater, in order to do its magic.
Here is a list of every trick I leveraged in its implementation:
1) The main copy loop is unrolled so that the entire FPU on the SH4 CPU--2 banks of 16 single-precision, 4 byte float registers--gets filled from the source buffer and written to the destination buffer, for a copy size of 128 bytes (4 cache lines) per iteration.
2) I swap to double-precision load/store mode with the FSCHG instruction, so that I can load and store 8-bytes into 2 registers at a time, for the same cycle cost as a regular 4-byte load/store.
3) I align the stack up to 8-bytes upon entry, so that I can push/pop the values of the FP regs that I'm clobbering, 8-bytes at a time as well. This is faster than GCC knows how to manage the stack.
4) I devised a complex prefetching scheme for streaming the contents of the source buffer into the 4 cache lines that get read each iteration of the main loop before they're actually accessed, which is WAY harder than it sounds for the SH4...
Any two overlapping prefetches is a stall, a write (even if it's a cache hit), during a prefetch is a stall, a cache miss during a prefetch is a double-fisted stall, and you need about 11 cycles for a prefetch to complete... so basically, looking at the SH4 the wrong way while it's issuing a prefetch will result in a full CPU pipeline stall, negating all gainz!
The main problem is that there is not enough cycles of non-stalling work for simply prefetching the cache line right before the cache line we're issuing load instructions on... so I've had to devise a scheme where prefetching happens TWO cache-lines ahead, so that they have plenty of time to complete before they are actually used within the pipeline.
5) The destination buffer, despite being a write-only buffer, will also result in a big-ass pipeline stall if it's not resident within the cache... meaning everything will stall while the PREVIOUS VALUE we're about to overwrite gets loaded... which is something we ain't got time for...
So I am manually "preallocating" the destination cache lines, one cache line ahead, just before I do a write to them, so that they are already resident within the cache, and there will be no stall, by the time they are written to.
6) I'm carefully pairing instructions based on their "group" types, as compatible instructions which are using different areas of the chip are able to leverage the superscalar nature of the pipeline and be dual-issued, so that they execute in parallel.
If you look at the group starting on line 75, you'll notice that I'm strategically interleaving integer ALU work while I'm pushing pairs of FP registers onto the stack with the FPU, as they execute in parallel.
7) I'm aligning the code for the hot 128-byte copy loop body to a 32-byte boundary within the .text segment, which is the size of an instruction cache line, so that it fits into as a few as possible, reducing the number of pipeline stalls on icache fetches, while the icache warms up, during the first iteration.
So after ALL OF THAT BS COMBINES, you can see the results of one of my performance benchmarks, which is copying a 12KB buffer, whose source and destination addresses straddle cache lines and are unaligned...
The results are quite drastic. When the instruction cache and data cache are hot, I achieve a 1.9605x performance speedup over our builtin memcpy()! When they're both cold, I achieve a whopping speedup of 3.5408x, due to the hell which I went through to manually manage the cache!
For the second run, I pitted STMicroelectronics' "fast_memcpy()" against Newlib's, which resulted in only a speedup of 1.8867x and 1.8913x for the hot and cold cache scenarios, respective... meaning I BEAT STMicro!!!! HEEEEEEEEELL YEAH, BABY!!!
Here's the source-code for the full shz_memcpy128() implementation, which you can check out, if you're feeling brave of heart: https://t.co/01gZZVlfb6
SH4ZAM already ships with the KallistiOS SDK for Sega Dreamcast as a first-party, built-in library within kos-ports... so go pull down the latest commit and git in on deez gainz!! 💪