A part of a growing family of open-source LLMs, such as Google’s LaMDA & PaLM, Hugging Face’s BLOOM & XLM-RoBERTa, Nvidia’s NeMO, XLNet, Co:here, & GLM-130B.
LLMs are trained on massive amounts of text & other modalities to produce human-like responses 2 natural language queries
@JR_Knox1977@ollama Improved reasoning, OCR, and world knowledge capabilities.
OCR (Optical Character Recognition), technology is often used to convert scanned documents or images into editable text, for example, when you scan a paper document and convert it into a PDF or Word file.
🚀We are thrilled to release LLaVA-1.6, with improved reasoning, OCR, and world knowledge. It supports higher-res inputs, more tasks, and exceeds Gemini Pro on several benchmarks! 🤯 It maintains the data efficiency of LLaVA-1.5, and LLaVA-1.6-34B is trained ~1 day with 32 A100s.
LLaVA-1.6 comes with base LLMs of different sizes: Mistral-7B, Vicuna-7B/13B, Hermes-Yi-34B. Thank you @MistralAI@lmsysorg@01AI_Yi@NousResearch
We serve LLaVA-1.6 efficiently with SGLang @lmsysorg. Thank you @lm_zheng@ying11231@shiyi_c98 for the integration.
And big shoutouts to awesome collaborators from the LLaVA team: @imhaotian@ChunyuanLi Yuheng Li @BoLi68567011 @zhang_yuanhan@shengs1123@yong_jae_lee
Blog: https://t.co/Mjd609CY2r
Demo: https://t.co/7VxuVA0gPk
Code/Model: https://t.co/gASnLmaUGY
If you’re interested in learning more about LLaVA or other LLMs, you can check out their websites, papers, GitHub repositories, demos, data sets, models, tutorials, and more.
You can also join the discussion on social media platforms or forums.
Let’s make history with LLMs!
A part of a growing family of open-source LLMs, such as Google’s LaMDA & PaLM, Hugging Face’s BLOOM & XLM-RoBERTa, Nvidia’s NeMO, XLNet, Co:here, & GLM-130B.
LLMs are trained on massive amounts of text & other modalities to produce human-like responses 2 natural language queries
It’s important to involve the research community & the public in the development & evaluation of LLMs, to ensure they are beneficial for humanity and society
The first multimodal models that’s easy to get up and running on consumer-level hardware - a GPU with less than 8GB of VRAM.
Also one of the first to use language-only GPT-4 2 generate visual instruction tuning data, a novel technique to teach the model without human supervision
A visual encoder and Vicuna, an open source chatbot based on Meta’s model, to make sense of images and text and how they relate.
It can answer questions about images, follow instructions across vision and language, and achieve state-of-the-art accuracy on the ScienceQA benchmark