An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”.
A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT.
As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on.
In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.
yep people don't realize the significant mlops effort required with custom fine-tuned models & thus accrue technical debt.
data preparation for fine-tuning is way harder than the fine-tuning itself, & you have to be committed to regularly repreparing and retraining on new data
Calling it now:
After coming back to SF, it's clear to me we are in an AI bubble. Too many AI founders are building picks and shovels just because VCs want them to. Very few founders building from personal problems and passion, most just building because that's what everyone else is doing. Very few actually solving real problems.
Indeed it was a great event..Thanks to Ayush Jain and mindbowser to hosting it and @vphalak and Devendra Laulkar for sharing insight on the GenAI and making a worthy Saturday morning 👏
Adding some more pics here 🙌
Below is an explanation of "Agentic" workflows in ~ 15 LOC. At its core, "Agentic" just means LLMs that can call functions.
Another ex of how unnecessary jargon confuses people
https://t.co/w0YYsxoyLT
@gokulr This applies to technical roles as well. The difference is that you can't hire good enough generalists out of college. However, if they get an effective mentor and learn fast, they can get good at several skills before specializing. How many founders can/want to?
@gwenshap Back in the day, you were forced to learn these concepts as well as low level OS. Databases were nowhere as self managed as today. Recovering a failed Oracle DB or performance tuning meant popping up the hood every single time.
The Delicate Dance of Large Language Models - Memorization vs. Generalization
Critics of LLMs claim that they are stochastic parrots that do not do much more than memorize the world's information. Others, however, point to examples of LLMs displaying logic and reasoning abilities and claim they are beginning to exhibit emergent properties and may evolve to be a superintelligence that threats our very existence
It turns out that LLMs are actually pretty good at memorization.
Data Imprinting: During the training phase, a language model essentially encodes specific data points into the weights of the neural network. This is akin to how you remember the capital of France is Paris.
N-gram Storage: Essentially, a trained model will have "memorized" a lot of sequences of words (N-grams), but this isn't memory in the way humans understand it. It's a complex mathematical representation that allows LLMs to predict the next word in a sentence based on the preceding words.
This means that LLMs are extremely good at retrieving pretty much any data that they have been trained on and presenting it in a coherent way. In other words, LLMs are capable of memorizing without overfitting
This is also why GPT-4 is exceptionally good at answering questions on information BEFORE its training cut-off date of September 2021 and doesn't do well on information after that cut-off date.
In addition, LLMs don't have the capability of learning on the fly - Any information that you send it in real-time isn't memorized making them far less capable than the human brain
Generalization: The real magic happens when LLMs can take what they've learned and apply it to new situations.
Fundamentally, generalization happens because language models map linguistic constructs into a high-dimensional vector space, wherein each dimension could hypothetically represent a unique linguistic feature—be it syntactical, semantic, or otherwise. Predictive capabilities in this space enable generalization to previously unseen data.
Each sentence or word is represented in this mathematical space with similar sentences like "How is the weather" or "Tell me more about the climate" being close together. This space is what enables LLMs to "understand" the language.
When a new sentence is introduced, LLMs can find its place in this semantic universe based on its features. This helps it generate responses that are contextually appropriate, even if the model has never seen that exact sentence during training.
The vector space also allows for contextual understanding - for example, the word "bank" has totally different meanings when used near a river vs. when used in the context of money
Transfer Learning and Domain Adaptability: LLMs can be fine-tuned on specialized datasets and can adapt to the new domain - this is a form of generalized learning and points to the predictive power of these models.
There is overwhelming evidence that LLMs like GPT-4 have remarkable generalization capabilities as well. The combination of memorization and generalization is what makes it so effective.
LLMs are in the Goldilocks zone where a model can both memorize and generalize effectively. Too much memorization can lead to overfitting, where the model is too tailored to the training data and sucks at handling new info. Newer LLMs will evolve to memorize and generalize more and will be capable of complex reasoning tasks much like humans do.
Tao Te Ching, Verse 17
"When the Master governs, the people
are hardly aware that he exists.
Next best is a leader who is loved.
Next, one who is feared.
The worst is one who is despised.
If you don't trust the people,
you make them untrustworthy.
The Master doesn't talk, he acts.
When his work is done,
the people say, "Amazing: we did it, all by ourselves!""
~Lao Tzu
CS 25 has a great roadmap of LLM papers! 🙏
Transformers United has great guest lectures spanning the foundations of Large Language Models
An underrated aspect of the course is the curated list of papers on every topic
Perfect for your weekend reads:
https://t.co/WNbLfzf3Pf
9 tips for misery in work & life:
1) See self as a victim
2) Complain constantly
3) Resent successful people
4) Compare habitually
5) Nitpick to feel superior
6) Assume bad intent
7) Refuse to listen to others
8) Aim to impress everyone
9) Learn only by making mistakes
@sathyarg Love this interpretation. And the point of self inquiry would be to understand ones true nature. And not get attached too much to the code you write, the ppts you create, your diagnosis,... :-) Tech nor time has anything to do with it.
We're on the lookout for a Data Engineer with 2+ years of experience, and a passion to troubleshoot issues across various data sources and platforms! Apply here -https://t.co/wGsfDV4pFj or share the resumes at [email protected]#eCommerce#Growth#Hiring#DataEngineer
A model is NOT just a LightGBM/XGBoost/NN/etc.
A model is the algorithm *plus*:
• The parameters
• The data preprocessing
• The engineered features
Good deployment pipelines make these inextricable and/or easily reproducible.