@wessorh With this new repository I have the intention of creating a canonical benchmarking suite for YARA-X and YARA: https://t.co/4A9wWQgEDu
I'll be adding more rules and files in the future. The idea is covering all those cases in which YARA and YARA-X tend to differ in performance.
@wessorh From the number of files, and their total size, I would say that the time is dominated by rule compilation/loading, and that would explain the bad results of YARA-X. For benchmarking scanning speed I would recommend bigger and more files, so that the scanning phase takes longer.
@cyb3rops@wessorh I was trying to recall the details about why I discarded the vectorscan/hyperscan option long ago, but I didn't write down my thoughts back then. I considered using hyperscan for YARA, but I found subtle differences that didn't make it a good fit for YARA's use case.
@wessorh I also would like to figure out why YARA-X seems to be consistenly slower than YARA C in your tests. My own tests point in the opposite direction after the improvements in YARA-X 1.16.0. But performance is very sensitive the the rules and files scanned.
@wessorh Contributions are always welcomed. For instance, a canonical set of rules and files that can be used for benchmarking would be really great. So that different implementations can be compared using that canonical set.
@wessorh@cyb3rops holloman2 is the interesting part because the claims are bold. If I understood correctly, you can enterily avoid the Aho-Corasick scan by comparing the file's fingerprint to the pattern's fingerprint. That would discard files that don't contain the pattern. Is that correct?
@halvarflake I feel the same. For projects that I really care about my strategy consists in forcing myself to treat the AI like a pair-programming buddy. With that mindset I feel productive but in control, and actually end up learning a few things from the AI.
@cyb3rops@wessorh I guess there's a lot of variance too. Person A can create tons of bad rules within seconds, while person B takes the time to find a pattern that is likely to cover most specimens of a malware family while not producing false positives.
@wessorh That sounds very interesting. If you can point me out to some repository I would take a look. Also, I think it would be great to have a canonical set of rules and files for benchmarking. I can come up with something.
YARA-X 1.16.0 is out!
This time with performance improvements that make it much faster. With this release YARA-X is faster than traditional YARA in almost every case.
https://t.co/dWYX5PmJy2