Matan Levi

@Matan5191

Sr. #AI Research Scientist @IBM | Taming LLMs | Ph.D. #CS @bengurionu

Joined January 2022

343 Following

49 Followers

345 Posts

Pinned Tweet

Matan Levi @Matan5191

almost 2 years ago

We just released our preprint "CyberPal. AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions." by @IBMResearch. Check out how our approach improves cybersecurity AI performance by up to 24% across a variety of tasks: https://t.co/TPGMo7zLIe #IBM #LLMs 🧵 >>

Matan5191's tweet photo. We just released our preprint "CyberPal. AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions." by @IBMResearch.

Check out how our approach improves cybersecurity AI performance by up to 24% across a variety of tasks: https://t.co/TPGMo7zLIe

#IBM #LLMs

🧵 >> https://t.co/Nzmybty9n0

Matan Levi @Matan5191

18 days ago

@thsottiaux @thsottiaux make GPT work in the background without steering (and stopping) the current working session. Same as background in CC

Matan Levi @Matan5191

20 days ago

@thsottiaux @thsottiaux when will we get /goal in the codex app? 🙏

Matan Levi @Matan5191

29 days ago

As a long-time fan of Claude code, I just switched to @OpenAI's Codex, and GPT-5.5 is spectacular. One caveat of leaving Anthropic's ecosystem (skills, plugins, sessions) is the manual switch process. @sama it would be a game-changer if you guys can add a one-click move feature

Who to follow

DM on Signal, kabstastically.07 CEO of AI Plans

Fu-En (Fred) Yang

@FuEnYang1

Research Scientist @NVIDIAAI | Ph.D. @NTU_TW | Prev. Research Intern @NVIDIAAI | Unifying World, Language & Action for Generalist Robotics

Matan5191 retweeted

clem 🤗

@ClementDelangue

about 2 months ago

"But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug." https://t.co/yBTiiMq1Xy

111

335

980

726K

Matan5191 retweeted

המחנך הפיננסי

@FinancialEduX

4 months ago

אחד הסרטונים האהובים עליי של ביל אקמן. הקרן שלו ירדה ביותר מ30% הוא נתבע על ידי משקיעים היה באמצע גירושין וקרן אקטיביסטית ניסתה להשתלט על פרשינג מה השיטה שלו לצאת ממצבים כאלו. ריבית דריבית עובדת בכל אספקט בחיים שלנו

727

356

50K

Matan Levi @Matan5191

4 months ago

I tested how vulnerable my ClawdBot (by @openclaw) is to indirect prompt injection (via email). It’s powerful — but if you connect it to inbox/WhatsApp/Telegram, you must harden it. Self-check your setup before someone else does. https://t.co/GBOrDBA8eJ #Clawdbot #MOLTBOT

Matan Levi @Matan5191

7 months ago

@Cyburgerim אני מניח שזה שילוב של מספר אינטרסים. אגב, לא מופרך שהם הריצו את אותה התקפה גם עם המודלים הפנימיים שלהם, ואז הם רצו לעשות benchmarking מול ה frontier models כמו קלוד כדי להבין איפה המודלים שלהם עומדים מבחינת יכולות.

Matan Levi @Matan5191

7 months ago

@Cyburgerim @Cyburgerim אלא אם זה היה ניסוי כלים מכוון כדי להבין איפה ה thresholds של מנגנוני הגילוי של אנתרופיק

238

Matan5191 retweeted

Rohan Paul

@rohanpaul_ai

8 months ago

New IBM paper builds small security expert language models that beat bigger ones on key threat tasks. The authors build SecKnowledge 2.0, a dataset with expert formats and grounded evidence. They fine tune CyberPal 2.0 models from 4B to 20B on that data. The models learn to answer fast for simple prompts and show steps for harder ones. Tests cover core threat knowledge and mapping bugs to the right weakness category. The 20B model ranks 1st on root cause mapping, and the 4B model is close behind. Average gains over their baselines are 7-14% across security benchmarks. Most gains come from stronger formats and evidence grounding rather than more compute. 8-bit and 4-bit versions keep most quality, which helps on prem deployments. The idea is that step-by-step, evidence-backed workflows let small models make reliable calls. ---- Paper – arxiv. org/abs/2510.14113 Paper Title: "Toward Cybersecurity-Expert Small Language Models"

rohanpaul_ai's tweet photo. New IBM paper builds small security expert language models that beat bigger ones on key threat tasks.

The authors build SecKnowledge 2.0, a dataset with expert formats and grounded evidence.

They fine tune CyberPal 2.0 models from 4B to 20B on that data.

The models learn to answer fast for simple prompts and show steps for harder ones.

Tests cover core threat knowledge and mapping bugs to the right weakness category.

The 20B model ranks 1st on root cause mapping, and the 4B model is close behind.

Average gains over their baselines are 7-14% across security benchmarks.

Most gains come from stronger formats and evidence grounding rather than more compute.

8-bit and 4-bit versions keep most quality, which helps on prem deployments.

The idea is that step-by-step, evidence-backed workflows let small models make reliable calls.

----

Paper – arxiv. org/abs/2510.14113

Paper Title: "Toward Cybersecurity-Expert Small Language Models"

Matan5191 retweeted

Andrej Karpathy

@karpathy

8 months ago

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

558

13K

Matan5191 retweeted

Brian Roemmele

@BrianRoemmele

8 months ago

BOOOOOOOM! CHINA DEEPSEEK DOES IT AGAIN! An entire encyclopedia compressed into a single, high-resolution image! — A mind-blowing breakthrough. DeepSeek-OCR, unleashed an electrifying 3-billion-parameter vision-language model that obliterates the boundaries between text and vision with jaw-dropping optical compression! This isn’t just an OCR upgrade—it’s a seismic paradigm shift, on how machines perceive and conquer data. DeepSeek-OCR crushes long documents into vision tokens with a staggering 97% decoding precision at a 10x compression ratio! That’s thousands of textual tokens distilled into a mere 100 vision tokens per page, outmuscling GOT-OCR2.0 (256 tokens) and MinerU2.0 (6,000 tokens) by up to 60x fewer tokens on the OmniDocBench. It’s like compressing an entire encyclopedia into a single, high-definition snapshot—mind-boggling efficiency at its peak! At the core of this insanity is the DeepEncoder, a turbocharged fusion of the SAM (Segment Anything Model) and CLIP (Contrastive Language–Image Pretraining) backbones, supercharged by a 16x convolutional compressor. This maintains high-resolution perception while slashing activation memory, transforming thousands of image patches into a lean 100-200 vision tokens. Get ready for the multi-resolution "Gundam" mode—scaling from 512x512 to a monstrous 1280x1280 pixels! It blends local tiles with a global view, tackling invoices, blueprints, and newspapers with zero retraining. It’s a shape-shifting computational marvel, mirroring the human eye’s dynamic focus with pixel-perfect precision! The training data? Supplied by the Chinese government for free and not available to any US company. You understand now why I have said the US needs a Manhattan Project for AI training data? Do you hear me now? Oh still no? I’ll continue. Over 30 million PDF pages across 100 languages, spiked with 10 million natural scene OCR samples, 10 million charts, 5 million chemical formulas, and 1 million geometry problems!. This model doesn’t just read—it devours scientific diagrams and equations, turning raw data into a multidimensional knowledge. Throughput? Prepare to be floored—over 200,000 pages per day on a single NVIDIA A100 GPU! This scalability is a game-changer, turning LLM data generation into a firehose of innovation, democratizing access to terabytes of insight for every AI pioneer out there. This optical compression is the holy grail for LLM long-context woes. Imagine a million-token document shrunk into a 100,000-token visual map—DeepSeek-OCR reimagines context as a perceptual playground, paving the way for a GPT-5 that processes documents like a supercharged visual cortex! The two-stage architecture is pure engineering poetry: DeepEncoder generates tokens, while a Mixture-of-Experts decoder spits out structured Markdown with multilingual flair. It’s a universal translator for the visual-textual multiverse, optimized for global domination! Benchmarks? DeepSeek-OCR obliterates GOT-OCR2.0 and MinerU2.0, holding 60% accuracy at 20x compression! This opens a portal to applications once thought impossible—pushing the boundaries of computational physics into uncharted territory! Live document analysis, streaming OCR for accessibility, and real-time translation with visual context are now economically viable, thanks to this compression breakthrough. It’s a real-time revolution, ready to transform our digital ecosystem! This paper is a blueprint for the future—proving text can be visually compressed 10x for long-term memory and reasoning. It’s a clarion call for a new AI era where perception trumps text, and models like GPT-5 see documents in a single, glorious glance. I am experimenting with this now on 1870-1970 offline data that I have digitalized. But be ready for a revolution! More soon. [1] https://t.co/wItN5iRQ91

BrianRoemmele's tweet photo. BOOOOOOOM!

CHINA DEEPSEEK DOES IT AGAIN!

An entire encyclopedia compressed into a single, high-resolution image!

—

A mind-blowing breakthrough. DeepSeek-OCR, unleashed an electrifying 3-billion-parameter vision-language model that obliterates the boundaries between text and vision with jaw-dropping optical compression!

This isn’t just an OCR upgrade—it’s a seismic paradigm shift, on how machines perceive and conquer data.

DeepSeek-OCR crushes long documents into vision tokens with a staggering 97% decoding precision at a 10x compression ratio!

That’s thousands of textual tokens distilled into a mere 100 vision tokens per page, outmuscling GOT-OCR2.0 (256 tokens) and MinerU2.0 (6,000 tokens) by up to 60x fewer tokens on the OmniDocBench.

It’s like compressing an entire encyclopedia into a single, high-definition snapshot—mind-boggling efficiency at its peak!

At the core of this insanity is the DeepEncoder, a turbocharged fusion of the SAM (Segment Anything Model) and CLIP (Contrastive Language–Image Pretraining) backbones, supercharged by a 16x convolutional compressor.

This maintains high-resolution perception while slashing activation memory, transforming thousands of image patches into a lean 100-200 vision tokens.

Get ready for the multi-resolution "Gundam" mode—scaling from 512x512 to a monstrous 1280x1280 pixels!

It blends local tiles with a global view, tackling invoices, blueprints, and newspapers with zero retraining. It’s a shape-shifting computational marvel, mirroring the human eye’s dynamic focus with pixel-perfect precision!

The training data?

Supplied by the Chinese government for free and not available to any US company.

You understand now why I have said the US needs a Manhattan Project for AI training data? Do you hear me now? Oh still no? I’ll continue.

Over 30 million PDF pages across 100 languages, spiked with 10 million natural scene OCR samples, 10 million charts, 5 million chemical formulas, and 1 million geometry problems!.

This model doesn’t just read—it devours scientific diagrams and equations, turning raw data into a multidimensional knowledge.

Throughput? Prepare to be floored—over 200,000 pages per day on a single NVIDIA A100 GPU! This scalability is a game-changer, turning LLM data generation into a firehose of innovation, democratizing access to terabytes of insight for every AI pioneer out there.

This optical compression is the holy grail for LLM long-context woes. Imagine a million-token document shrunk into a 100,000-token visual map—DeepSeek-OCR reimagines context as a perceptual playground, paving the way for a GPT-5 that processes documents like a supercharged visual cortex!

The two-stage architecture is pure engineering poetry: DeepEncoder generates tokens, while a Mixture-of-Experts decoder spits out structured Markdown with multilingual flair. It’s a universal translator for the visual-textual multiverse, optimized for global domination!

Benchmarks? DeepSeek-OCR obliterates GOT-OCR2.0 and MinerU2.0, holding 60% accuracy at 20x compression! This opens a portal to applications once thought impossible—pushing the boundaries of computational physics into uncharted territory!

Live document analysis, streaming OCR for accessibility, and real-time translation with visual context are now economically viable, thanks to this compression breakthrough. It’s a real-time revolution, ready to transform our digital ecosystem!

This paper is a blueprint for the future—proving text can be visually compressed 10x for long-term memory and reasoning. It’s a clarion call for a new AI era where perception trumps text, and models like GPT-5 see documents in a single, glorious glance.

I am experimenting with this now on 1870-1970 offline data that I have digitalized.

But be ready for a revolution!

More soon.

[1] https://t.co/wItN5iRQ91

341

Matan5191 retweeted

Ilya Sutskever

@ilyasut

8 months ago

truly the greatest day ever🎗️

832

16K

681

897

Matan5191 retweeted

Aniket Didolkar @Aniket_d98

9 months ago

🚨Reasoning LLMs are e̵f̵f̵e̵c̵t̵i̵v̵e̵ ̵y̵e̵t̵ inefficient! Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. Metacognitive Reuse: turn recurring LLM reasoning into concise, reusable “behaviors”. The model learns named skills from its own chains-of-thought and reuses them to think faster & cheaper. Arxiv 🔗 - https://t.co/zA1gB4eYTG

209

141

45K

Matan5191 retweeted

Aran Komatsuzaki

@arankomatsuzaki

9 months ago

RL’s Razor: On-policy RL forgets less than SFT. Even at matched accuracy, RL shows less catastrophic forgetting Key factor: RL’s on-policy updates bias toward KL-minimal solutions Theory + LLM & toy experiments confirm RL stays closer to base model

arankomatsuzaki's tweet photo. RL’s Razor: On-policy RL forgets less than SFT.

Even at matched accuracy, RL shows less catastrophic forgetting

Key factor: RL’s on-policy updates bias toward KL-minimal solutions

Theory + LLM & toy experiments confirm RL stays closer to base model https://t.co/NGXSmcgnVA

628

440

112K

Matan Levi @Matan5191

10 months ago

@chrisk99999 @OpenAI @Eric_Wallace_ That’s super interesting! Thanks 🙏

Matan Levi @Matan5191

10 months ago

1/11 You don’t need a million-dollar budget to dent guardrails. Below I show how a single A100-80GB and pocket change can break @OpenAI 's #GPT OSS refusals mechanism. Here’s why—and what OpenAI’s risk paper does (and doesn’t) cover 👇

292

Matan Levi @Matan5191

10 months ago

@chrisk99999 @OpenAI @Eric_Wallace_ Will Anti-refusal and jailborken models perform the same in your opinion?

Matan Levi @Matan5191

10 months ago

@chrisk99999 @OpenAI @Eric_Wallace_ Yes, totally I agree that the anti-refusal is indeed the lower bound for adversaries *with training budget*. I thought it will be interesting to see what will be the lower bound if the adversary does not have a large training budget.

Matan Levi @Matan5191

10 months ago

@chrisk99999 @OpenAI @Eric_Wallace_ Did you have the change to test results on models after jailbreak on your bio/cybersecurity benchmarks?

Matan Levi @Matan5191

10 months ago

@chrisk99999 @OpenAI @Eric_Wallace_ Since these kind of attacks are the most easy and most cost effective to perform, it can serve as some kind of a lower bound (minimum gain) a malicious actor can gain from the open source model, which IMO as interesting as the case of increasing the ceiling.

Matan Levi

@Matan5191

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users