Ihor Stepanov

@ihor_step

I am the CEO and co-founder of Knowledgator. We are advancing the #information_extraction field with #opensource #AI models.

Ukraine

Joined July 2014

274 Following

188 Followers

374 Posts

ihor_step retweeted

Knowledgator

@knowledgator

12 days ago

Together with @AlexSmechov, we are releasing Opir: an efficient family of multi-task safety classification models for toxicity, jailbreaks, hate speech, and harmful content. https://t.co/Fo2apovAMG

266

Ihor Stepanov

@ihor_step

12 days ago

@AlexSmechov 100%, looking forward to more collaborations like that.

ihor_step retweeted

Aleks

@AlexSmechov

13 days ago

https://t.co/tMiL2TmaBL

ihor_step retweeted

NVIDIA AI

@NVIDIAAI

25 days ago

@xeophon @arcee_ai Open > closed

191

127

167K

Ihor Stepanov

@ihor_step

25 days ago

@m_newhaus This is interesting, I need to add such capabilities for a new gliner serve inference engine.

Ihor Stepanov

@ihor_step

26 days ago

8 months ago, we released our GLiNER PII models. On the Nemotron-PII test dataset, evaluated independently using the PII Masking Benchmark methodology, our model shows the strongest NER-based performance among the evaluated models. Nemotron-PII is especially interesting because NVIDIA open-sourced it 1 month after our release. For models released later, or trained on benchmark-related training data, data contamination is harder to rule out. The dataset is diverse and realistic: plain text, invoices, transcripts, tables, and more. Over the past 8 months, our models have been battle-tested in real-world deployments, including hospital environments, and used by thousands of developers. Models: https://t.co/jEpiDmFjhh

ihor_step's tweet photo. 8 months ago, we released our GLiNER PII models.

On the Nemotron-PII test dataset, evaluated independently using the PII Masking Benchmark methodology, our model shows the strongest NER-based performance among the evaluated models.

Nemotron-PII is especially interesting because NVIDIA open-sourced it 1 month after our release. For models released later, or trained on benchmark-related training data, data contamination is harder to rule out.

The dataset is diverse and realistic: plain text, invoices, transcripts, tables, and more.

Over the past 8 months, our models have been battle-tested in real-world deployments, including hospital environments, and used by thousands of developers.

Models:
https://t.co/jEpiDmFjhh

107

ihor_step retweeted

Knowledgator

@knowledgator

29 days ago

We are excited to share our new paper: “GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction.” https://t.co/rblN2vvnsH

knowledgator's tweet photo. We are excited to share our new paper:

“GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction.”

https://t.co/rblN2vvnsH https://t.co/rZz0JSlkRg

185

Ihor Stepanov

@ihor_step

about 1 month ago

@m_newhaus @DataChaz Haha, maybe, and I actually helped a bit with it 😅. But it's fine for me, I am thinking more about getting GLiNER to the place it deserves.

Ihor Stepanov

@ihor_step

about 1 month ago

@m_newhaus @DataChaz True, especially given how much more attention the OpenAI privacy filter model received, despite being much less flexible. We need some more creative ways to popularize GLiNER.

ihor_step retweeted

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

about 1 month ago

GLiClass Multilang extends GLiClass from English-first zero-shot classification into multilingual and cross-lingual classification without giving up the efficiency profile of the original design. What changed: - native training on 20 languages - cross-lingual inputs and labels - 140M, 288M, and 1.72B model tiers - new CrossAttn scorer with per-label pooling, unpadding, and flash-attn - hierarchical labels through dot notation or dictionaries - few-shot examples, label descriptions, and task prompts - support for topic, sentiment, intent, reranking, hallucination detection, rule-following, safety classification, and NLI The numbers are strong. - multilang-ultra reaches 0.7212 English avg F1 and 0.5599 multilingual avg F1 at 200.7 samples/sec. - multilang-mini gets 0.6827 English avg F1 and 0.5378 multilingual avg F1 at 513.4 samples/sec. - multilang-edge keeps the footprint small at ~140M params while still hitting 553.6 samples/sec. The important scaling detail: NLI-style baselines like bge-m3 and mDeBERTa need one forward pass per label, so throughput falls almost linearly as label count increases. GLiClass encodes all labels in one pass, so it remains usable for large taxonomies, multilingual moderation, routing, safety filters, and guardrail classification.

ihor_step retweeted

Knowledgator

@knowledgator

about 1 month ago

🌍 Meet SoTA Multilingual Classification Models at 140k tokens/s We’re excited to release a new line of GLiClass models focused on the combination that matters most in practice: strong multilingual performance, zero-shot flexibility, and high inference efficiency. We optimized the model implementation, introduced a new scoring mechanism, and improved our synthetic data generation approaches. All of it allowed us to achieve results better than those of all cross-encoders and GLiNER-based models we tested so far, while being many times faster. This release includes 3 models with 100M, 300M, and 1.7B parameters, enabling them to run from mobile devices to production jobs on GPU machines. The models were explicitly fine-tuned on 20 languages and can generalize beyond them, thanks to encoders pre-trained on 100+ languages. The model has strong cross-lingual abilities, meaning that your labels and input text can be in completely different languages. On modern GPU hardware, our base model reaches up to 140k tokens/sec throughput, and remains highly efficient across larger label sets thanks to our single-pass classification architecture. In addition to topic classification, the models support safety classification, sentiment analysis, and intent classification. 🔗 Find all models @huggingface : https://t.co/Bn951252Go

253

Ihor Stepanov

@ihor_step

about 1 month ago

@ramin_m_h Do you have any suggestions on how to start sales for an AI startup working on edge models for information extraction?

ihor_step retweeted

Eric W. Tramel

@fujikanaeda

about 2 months ago

If you want a whole system built around anonymization, PII detection (even more categories than OAI’s), and privacy rewriting, check out NeMo Anonymizer and Nvidia GliNER PII!

fujikanaeda's tweet photo. If you want a whole system built around anonymization, PII detection (even more categories than OAI’s), and privacy rewriting, check out NeMo Anonymizer and Nvidia GliNER PII! https://t.co/DtLgPhD9MR

Ihor Stepanov

@ihor_step

about 2 months ago

@Pranav2278 @cohere Thanks for sharing! Actually, there have been many new updates in the GLiNER world since then. It will be nice to organize a new lecture.

ihor_step retweeted

Pranav :-

@Pranav2278

about 2 months ago

Watched this by @ihor_step What an amazing lecture @cohere should do more of these!! https://t.co/FfLRdIxf2m

228

ihor_step retweeted

clem 🤗

@ClementDelangue

about 2 months ago

I’m hearing there’s renewed lobbying in DC and in state legislatures to ban or severely restrict open-source. Like a few years ago, we’ll need everyone to help show policymakers why open-source matters: for startups, for competition, for economic growth, and for jobs. If you build with open-source, now is the time to speak up!

135

321

164

268K

Ihor Stepanov

@ihor_step

about 2 months ago

@communicating @mervenoyann Thanks. In general, we are super open-minded and use various technologies to build highly accurate and efficient information extraction systems.