Large, Open, Hebrew base model!
---
Today I release the largest, most powerful open source Hebrew base model ever trained.
(And it is also great in English!
Benchmarks coming soon!)
---
Hebrew-Gemma-11B
- Base Model: https://t.co/N2wqmV8go6
- Instruct Model: https://t.co/7rBrqR8DR8
---
Continuously Pretraining Gemma-7B
A few hours after the release of Gemma-7B, it was discovered in a private conversation on ML Israel's back channels that Gemma's tokenizer was trained on multilingual data that also included Hebrew.
Gemma-7B is considered to be the longest trained open-weights model at the moment [0] and has high scores across many different benchmarks.
Or in other words,
We had an opportunity to continue pretraining Gemma-7B and create a High-quality Hebrew model.
---
Extending the model
In recent months, several different discoveries have been made about the ways LLM architectures work and their flexibility after they are trained.
Today we know that parts of trained models can be moved to different trained models and that LLMs can sometimes be improved simply by duplicating layers.
Three particularly high-performing base-models took advantage of this: Solar [1], LLaMA-Pro [2] and Goliath-120B [3].
- Goliath: Discovered that we can merge layers from several different models and that we can even "merge a model with itself" (duplicating layers) to improve performance.
- Solar: Discovered that models that were "merged with themselves" and then underwent a short fine-tuning session show an extreme jump in performance. (The 11B Solar model outperform even Falcon-180B)
- LLaMA-PRO: Discovered that "only training the expanded parts" can be used to avoid catastrophic forgetting when training models on specific domain knowledge while maintaining the previous knowledge.
-
For this model, I combined all three approaches:
1. The model was expanded to 11B parameters.
2. Then continuously pre-trained.
3. Specific parameters of the model were dedicated only to the 3B Hebrew tokens that were added to the training dataset.
-
Moreover,
The model's training scheme is based on lessons learnt during the training the original HebrewGPT:
The training is specially designed to encourage the the base-model to "translate" its internal representations from English to Hebrew, thus enjoy high performance in both languages and reduce the compute greatly.
This is done by first exposing the "Hebrew parameters" of the model to a special translation dataset that is specifically built from English<->Hebrew Translation pairs.
Then, the training resorts back to regular pre-training with balanced batches always containing 50%-50% ratio of English and Hebrew tokens.
[The English dataset is Cosmopedia by Huggingface]
---
Want to help?
Continuing pre-training:
At this point my Hebrew data collection and generation infrastructure had been running for 10-months straight in the background without any human contact.
I have collected about 500B Hebrew tokens.
(And I'm not afraid to use them! 😏)
At the moment,
I have compute to continue pre-training the base model for around ~1 month.
(During idle times of a GPU server owned by a tech-startup in Israel)
So during this period, the training will continue and improved versions of the model will be uploaded to Huggingface.
If it was up to me I don't mind to continuing the training forever given that there is compute power available.
---
Apart from continuing the training, there are some training methods I am interested incorporating into the model:
- Training the model also as an embedding model to support RAG in Hebrew since no model specifically is trained for this at the moment. [4]
- Adding Arabic and Russian into the training data: The 3 most spoken languages in Israel. Note: Generalizing to additional languages is especially cheap with the above training method.
- Deepening and Expanding the model even more! If compute allows it, I would love to extend the model and add more parameters to it throughout the training.
- Extending the context window of the model: Following Ring Attention [5][6] It is now possible to train models with very long context lengths of hundreds of thousands or even millions of tokens.
If you want to see this happen on an open source Hebrew model and you have access to compute you are willing to donate for the project:
I would be more than happy if you reach out via DM.
---
Note: Limitations of this model.
This is not the original HebrewGPT.
HebrewGPT has been trained for 100 times longer.
Therefore it knows a great deal of general "Israeli" knowledge and factual information.
You should not expect this current base model to know subtle "Israeli" facts.
For example: The laws of the State of Israel, orders from the Defense Ministry, recipes of the Israeli kitchen, Deep religious knowledge, etc.
A great deal of data on all these topics (and much more) was collected during the creation of HebrewGPT and will be fed into this base model training if computing allows it.
Until Then:
The best way to use this model is with context or for fine-tuning.
---
Another Limitation: The Instruct Model.
Please note, the instruct model was trained quickly just to give a glimpse into the potential of the base-model after further fine-tuning.
I would not suggest using it as is for a downstream task.
I highly recommend fine-tuning the base-model for your own use-case instead.
---
Behind The Scenes
Although the training started 2-weeks ago on the same day Gemma-7B was released, the actual training lasted 5 days due to a series of technical issues.
These were very long 2-weeks..
Gemma was a hard one to train..
But the results are worth it,
The model is noticeably much better than the original!
---
Enjoy!
(And tell me what you think!
I still got compute for final tweaking and fixings)
---
Refs:
[0] The original Gemma-7B model was trained on 6-Trillion tokens.
[1] SOLAR: https://t.co/7HDqs22VOQ
[2] LLaMA-PRO: https://t.co/4wQURmzaeO
[3] Goliath-120B: https://t.co/axt6OsRVoQ
[4] GRIT is a model trained to be both a language model and a semantic search model, the combined training not only does not harm either of them, it even improves both (and of course saves the users a lot of vRAM during RAG since only one model is needed to be loaded into memory): https://t.co/kfLrbK6lYA
[5] World Model was created with a window length of one million tokens here: https://t.co/JRbcfNcvSi