This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc.
More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage:
1) raw text (hard/effortful to read)
2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default
3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default
...4,5,6,...
n) interactive neural videos/simulations
Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral https://t.co/z21CP5iQfu
There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen.
TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.
1. I never said LLMs were not useful. They are, particularly with all the bells and whistles that are being added to them. I use them.
2. A robot-rich future can't be built with AIs that don't understand the physical world and don't anticipate the consequences of their actions. And LLMs really don't.
3. The future in the cartoon looks pretty dystopian TBH, but even a non-dystopian version will require world models and zero-shot planning abilities.
4. I rarely wear a suit and absolutely never wear a tie.
5. I would never ever place a coffee mug on top of a piece equipment.
6. I hope I'll look this young in 2032.
As AI agents accelerate coding, what is the future of software engineering? Some trends are clear, such as the Product Management Bottleneck, referring to the idea that we are more constrained by deciding what to build rather than the actual building. But many implications, like AI’s impact on the job market, how software teams will be organized, and more, are still being sorted out.
The theme of our AI Developer Conference on April 28-29 in San Francisco is The Future of Software Engineering. I look forward to speaking about this topic there, hearing from other speakers on this theme, and chatting with attendees about it. We’re shaping the future, and I hope you will join me there!
It is currently trendy in some technology and policy circles to forecast massive job losses due to AI. Even if they have not yet materialized, these losses certainly must be just over the horizon! I have a contrarian view that the AI jobpocalypse — the notion that AI will lead to massive unemployment, perhaps even rioting in the streets — won’t be nearly as bad as dire forecasts by pundits, especially pundits who are trying to paint a picture of how powerful their AI technology is.
Among professions, AI is accelerating software engineering most, given the rise of coding agents. According to a new report by Citadel Research, software engineering job postings are rising rapidly. So if software engineering is a harbinger of the impact AI will have on other professions, this expansion of software engineering jobs is encouraging.
Yes, fresh college graduates are having a hard time finding jobs. And yes, there have been layoffs that CEOs have attributed to AI, even if a large fraction of this was “AI washing,” where businesses choose to attribute layoffs to AI, even though AI has not changed their internal operations much yet. And yes, there is a subset of job roles, such as call center operator, that are more heavily impacted. Many people are feeling significant job insecurity, and I feel for everyone struggling with employment, whether or not the cause is AI-related. And many other factors, such as over-hiring during the pandemic and high interest rates, have contributed to the slowdown in the labor market, and the notion that AI is leading to unemployment is oversimplified.
In software engineering, I see a lot of exciting work ahead to adapt our workflows. It is already clear that: (i) As AI makes coding easier, a lot more people will be doing it. (ii) Writing code by hand and even reading (generated) code is not that important, because we can ask an LLM about the code and operate at a higher level than the raw syntax (although how high we can or should go is rapidly changing). (iii) There will be a lot more custom applications, because now it’s economical to write software for smaller and smaller audiences. (iv) Deciding what to build, more than the actual building, is becoming a bottleneck. (v) The cost of paying down technical debt is decreasing (since AI can refactor for you).
At the same time, there are also a lot of open questions for our profession, such as:
- In the future, what will be the key skills of a senior software engineer? And for junior levels, what should be the new Computer Science curriculum?
- If everyone can build features, what skills, strategies, or resources create competitive advantage for individuals and for businesses?
- What are the new building blocks (libraries, SDKs, etc.) of software? How do we organize coding agents to create software?
- What should a software team look like? For example, how many engineers, product managers, designers, and so on. What tooling do we need to manage their workflow?
- How do AI agents change the workflow of machine learning engineers and data scientists? For example, how can we use agents to accelerate exploring data, identifying hypotheses, and testing them?
I’m excited to explore these and other questions about the future of software engineering at AI Dev. I expect this to be an exciting event. Please join us!
[Original text: The Batch newsletter.]
https://t.co/i4bQevDG4i
LLM Knowledge Bases
Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So:
Data ingest:
I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them.
IDE:
I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides).
Q&A:
Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale.
Output:
Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base.
Linting:
I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into.
Extra tools:
I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries.
Further explorations:
As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows.
TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.
Me and the team at @lovable just spent two months rewriting 42,000 lines of code from Python to Go.
Technical deep dive of why we did it +what this means:
// 1
I keep seeing a shared roadmap created by someone to get started with NLP and its fundamentals. It starts with Transformers and LSTMs...
This is insane...
I can't buy this. Randomly mentioning percentage of total counts doesn't mean how great is an AI agent, especially in support and customer facing use-cases. Every case needs to be examined, categorized, and complexity clarified, then the agent assessment can start from there.
So Marc Benioff just went onto the @loganbartlett show and claimed that Salesforce has a...*checks notes* 86% AI resolution rate for support inquiries!
Color me deeply skeptical
بما ان فيه كلام كثير اليومين دول عن ال Personal Finance، دي بعض المبادئ البسيطة اللي ممكن تساعد، مش كل الخطوات حتبقى لكل الناس دلوقتي، بس حطها في بالك لما ييجي وقتها.
و طبعا قبل كل حاجة دي أرزاق من عند الله، و احنا بس بنحاول نسعى لإدارة افضل لحياتنا و أننا نبعد عن الإسراف.
واحد من أنجح اللي اشتغلوا في Intel في عزها John Doerr عندة مقولة مشهورة:
"We need teams of missionaries, not teams of mercenaries."
"نحن نحتاج لفريق من المُبشرين وليس من المرتزقة"
هاشرح واحدة من أهم المميزات اللي بتعرف بيها الشخص الشاطر والمهم للفريق وللشركة ..
https://t.co/hrUP1ZOlHt
We recently had a CUDA cohort @CohereForAI. We'll be releasing the assignments and reads on github soon.
You can go through the cohort recordings in the Mini Cohort Section of the page above.
Do follow CUDA Mode on Youtube and discord. Give PMPP book a read too! Honestly PMPP is a standalone banger.
If you are not familiar with concurrency and related foundational concepts the cohort recordings would cover that too.
This is not a promotion I swear haha.
📣 اول ستريم من "Rust للغلابة" يوم الخميس ١٣/٦ الساعة ٧م بتوقيت لندن (٩م بتوقيت القاهرة) ان شاء الله. هابتدي بالاساسيات لكن بطريقة مختلفة عن التقليدي. السلسلة معمولة للناس اللي عندها معرفة بلغات تانيه و هاتكون خليط بين شرح Rust و كود عملي و شرح لأساسيات ال systems programming.
https://t.co/1NbE23lbxZ
GitQL 0.21.0 now support single and multi-dimensions array and index expressions inspired by PostgreSQL
Github: https://t.co/nDVSL7NrPa
#gitql#rust#git
العالم محتاج ناس اكتر تتعلم Rust بصراحة. بفكر ابتدي سلسله كلها Rust بس ماتبقاش دروس ممله، تبقى مشاريع صغيره نكتبها live. ولا نعمل كورس تعلم Rust بدون معلم الاول؟
📢New: Part 2 (of 3) What We Learned from a Year of Building with LLMs
https://t.co/RE0OCdSCCy
If you liked Part 1, Part 2 is a banger. We answer the following:
Some of my favorite takes:
AI Engineering Is NOT All You Need
Look At Your Data
We found that most people are not looking at their data as frequently as they should. You have to do this if you are serious about improving your AI
Evals Are About Iterating Quickly
Some people think that Evals are an academic exercise. But it is not! Evals dramatically increase the speed at which you can ship.
You can read the full article here: https://t.co/RE0OCdSCCy
You can follow the authors of this paper to hear additional commentary and hear about upcoming Part 3: @eugeneyan@BEBischof@charles_irl@sh_reya@jxnlco@HamelHusain
So I've been doing consulting for a couple of rad companies and the playbook usually is the same thing.
The first thing you do is you joined the team and you'd realize the engineers aren't really being quantitative with how they're describing problems. They're using words like "it looks like", "it feels like", "it looks a lot better" without actually saying "this metric improved by that much". And so the first problem we addressed is just understanding metrics, right?
Like people want to think about like the LLM or the judge mattress, but there are actually just positional recall metrics people need to understand and really internalize.
And there are techniques to improve that and design and synthetically generate data sets to help you evaluate that ahead of going into production.
Once it goes into production, the biggest issue is people lack the ability to do good observability. You maybe have:
- Questions being asked
- Thumbs up, thumbs down (maybe the copy for those buttons are not as good as they should be)
- Not using full text search in conjunction with embedding search
- Not using coherent rankers, even though we can afford it, there's no latency budget
And then we can also answer questions about what goes into context and whatnot, but ahead of all of that, to go back to the first point, we really need to nail down precision and recall and understanding and internalizing why that matters.
Once the observability is in place, now you can do things like figure out:
When I ask a question, it is returning like 10-20 documents:
- What was the cosine similarity of the text chunks?
- What were the cohere ranker scores for those text chunks?
- What are the averages?
- Do those average values predict whether or not it's a thumbs up and thumbs down?
- If it doesn't, is the thumbs up/down metric the one I really care about right?
You know the difference between "how did we do" versus "do we answer your question" could be huge, right? Because maybe in the first one, how did we do, we did answer the question, but it was a bit slow, it was a bit verbose. Is that really what we're trying to capture with the thumbs up thumbs down button? Unsure.
There's also like tons of novelty effects and those tend to wear off and you really should be reading results maybe like weeks after launch. One of the biggest things you can do is just looking at the questions ordered by the cosine similarity that we ranked their score and just finding questions you're doing poorly on - not doing poorly in the sense that we know we didn't answer the question, but even the retrievers themselves don't feel good about the system. We should have learned something about that.
I go back to number one around synthetically generating data for evaluation. What I've found was:
- If we did like Paul Graham essays and we took Paul Graham essays, took the chunks and said, "Okay AI, give me a question to this chunk that answer", and then we would test whether or not that chunk returned was returned in the document. It was like 96%. It was like really good.
- When we did the same task on GitHub issues - same prompt, same task, same evaluator - got like 65%.
Different data sets have different difficulties. And if we can't even nail synthetic questions and we understand that the question is going to be really challenging.
Okay, so now you have a list of questions with their performances. So what do you do with that?
Now that you have this list of questions, you can do things like topic modeling. You can go figure out, are there certain domains that we're, that our users are asking and does it make sense to even model it that way? I usually find that it's helpful to model it in two specific ways:
1. Topics - These are like the way the words are, or these are like the embeddings that we might be searching. This is the content of the data.
2. Capabilities - More like:
- "Oh actually, if you do the questions are just asking for the document. They're not asking, they're not trying to actually get an answer about a document that just where's the file for X."
- Some things might be time-related - "Show me the latest X, show me the most recent Y." Maybe that's objective.
- Maybe it's around the metadata, right? Maybe it is "Show me the last document modified by Jason." That has nothing to do with embedding. You need to have some kind of modified date filter. You need to be able to do some kind of keyword matching on the created by name. Those are capabilities and those are things your RAG system can't do unless you go code that into the problem.
But ultimately being able to cluster these, find those groups, find those topics, find those capabilities, you can then do group-bys against those attributes and figure out what the counts and frequencies are of these questions, and also what is the average score of these systems by doing that?
We can figure out what to prioritize. Okay, maybe it's the case that document search is 60% of the question volume - this is sick, we don't need to do RAG, then we can just do search. And the thumbs up score is like 90% and the cosine distances are like close to zero. So great, this is this is a pretty solved.
It's probably gonna happen is:
Latest documents created by Jason, Paula has terrible results, right? And not only do they have terrible results, that's a problem a language model can't solve with embeddings. And maybe that's only 2% of the questions. Okay, it's important, we have to implement this, but it's too low stakes to really solve this.
Maybe instead of trying to kill yourself over solving those 2% of the distribution, maybe you just say, "Okay, put it on the prompt - we cannot ask questions about who made documents." Just by saying no, you've improved your overall success by 2%, because if you just said no to the questions you can't handle. Maybe it's something worth building in the future, but for 2%, that's probably something else.
Maybe something else around date ranges is more important. Maybe that's 20% of the questions and we can't do any of that. And so now not only do you have a system that can do a lot of ability, you can use that observability, figure out the questions, the scores, and the performance to find groups and segments.
And by segmenting those groups, you can then find the count and the quality, and then prioritize based on the volume of what your production, if it looks like, rather than just making stuff up or hoping that you could build a system that's a hundred percent accurate when that will never be the case.
Once you have all of that in place, now you have a handful of different search tools:
- Maybe you have the ability to search documents
- Maybe you have the ability to do RAG using date ranges and modified timestamps and whatnot
Now you have many sort of systems. And I honestly recommend just having very specific tools, currently just recognizing that as these tools improve. And once these tools are all documented and well understood independently, then you can like, wire things up as an agent and just have a better system that has less UI complexity.
But even that, if you think about something like Google:
- Day one, they have websites
- Day two, they'd realize that the only way to find a restaurant is if it showed up at a review. And so maybe they build out maps, right?
- Maybe the only way to find a photo was if it was in a website and they go, "Oh, maybe we should have photos."
- And then, "Okay, video is a great data set. Let's go buy a video asset" and they buy YouTube.
And it all comes down to specializing in these specific verticals - websites, maps, videos, photos, shopping, books, whatever it is. They are prioritizing and making an investment in each of these segments, rather than trying to do everything all at once by just claiming that like embeddings can solve all of that.
I think that's the big issue that many companies are running into now where they just believe that. But honestly, like I was fine tuning like multimodal embedding models in 2016. And I did that because I didn't understand like how to deploy like OpenSearch or Elasticsearch.
Using embeddings and using faiss to do search is approximately an intern project. And real systems need to be much more complex and you have to prioritize and figure out how to spend the time and the money to make these systems better.
And embeddings are not all that you need