Ed Parsons

@edparso

I'm a guy who likes to build stuff!

San Francisco, California

Joined October 2008

195 Following

106 Followers

165 Posts

Ed Parsons @edparso

2 months ago

Some people just want to see the tokens burn

Ed Parsons @edparso

5 months ago

In the prisoner's dilemma, despite the fact that both players cooperating is Pareto efficient, the only pure Nash equilibrium is when both players choose to defect.

Ed Parsons @edparso

5 months ago

#TIL #6 In addition to the well known Prisoners Dilemma, a similar game theory concept is Stag Hunt. The stag hunt differs from the prisoner's dilemma in that there are two pure-strategy Nash equilibria:[2] one where both players cooperate, and one where both players defect.

Ed Parsons @edparso

5 months ago

#TIL #5 The S&P returned 18% last year, its third double digit year in a row. However if you are a gold based investor rather than a dollar based investor, it returned -28%. Same companies, same performance, just different perspectives.

Who to follow

Advait Ruia

@Advait_Ruia

Cofounder at SuperTokens: Open source user auth (YC S20)

Jeremy Zhang

@jerzzhang

ceo/co-founder @tryfinch

Mo Aziz

@moaziz_1

Cofounder at @plutocardMENA

Ed Parsons @edparso

5 months ago

#TIL #4 Parasocial relationships being developed with AI. Study of 3,532 people shows relationship-seeking AI quickly feels good, then less so, while “wanting” keeps growing. Users form attachment, seek more AI companionship, see it as friend not too, gain no psychosocial benefit

Ed Parsons @edparso

5 months ago

@businessbarista Claude

Ed Parsons @edparso

5 months ago

#TIL #3 The Zeigarnik Effect is the psychological tendency to remember unfinished or interrupted tasks better than completed ones. This explains why we get distracted by incomplete to-do lists, but also how starting a task, even briefly, increases the likelihood of finishing it

Ed Parsons @edparso

5 months ago

Village, sleep late, fish a little, play with your kids, take siestas with your wife, sip wine, and play guitar with your friends.

Ed Parsons @edparso

5 months ago

TIL #2: the Mexican fishermen parable. An American investment banker meets a Mexican fisherman who catches a few large tuna in only a little while—just enough to support his family. Asked why he doesn’t stay out longer, the fisherman explains he sleeps late, fishes a little,

Ed Parsons @edparso

5 months ago

Control production and distribution. Move from the small village to Mexico City, then Los Angeles, then New York. In 15–20 years, announce an IPO, sell stock, and make millions. “And then what?” the fisherman asks. “Then you retire,” says the banker—move to a small coastal

Ed Parsons @edparso

6 months ago

TIL for every 100 South Koreans today, there will be 6 grand children. This feels like one of the craziest birthrate stats I’ve ever heard. #1

Ed Parsons @edparso

6 months ago

@mikefutia Reels

Ed Parsons @edparso

about 2 years ago

This came out 4 days before OpenAI released 4o rather than 5. Great timing as the counter to pure AI optimism. https://t.co/wylvkN6EXF seems it will take another transformer level breakthrough, I’d bet on us to find it.

edparso retweeted

Gary Marcus

@GaryMarcus

about 2 years ago

GPT-4o hot take: • The speech synthesis is terrific, reminds me of Google Duplex (which never took off). but • If OpenAI had GPT-5, they have would shown it. • They don’t have GPT-5 after 14 months of trying. • The most important figure in the blogpost is attached below. And the most important thing about the figure is that 4o is not a lot different from Turbo, which is not hugely different from 4. • Lots of quirky errors are already being reported, same as ever. (See e.g., examples from @RosenzweigJane and @benjaminjriley.) • OpenAI has presumably pivoted to new features precisely because they don’t know how produce the kind of capability advance that the “exponential improvement” would have predicted. • Most importantly, each day in which there is no GPT-5 level model–from OpenAI or any of their well-financed, well-motivated competitors—is evidence that we may have reached a phase of diminishing returns.

GaryMarcus's tweet photo. GPT-4o hot take:
• The speech synthesis is terrific, reminds me of Google Duplex (which never took off).
but
• If OpenAI had GPT-5, they have would shown it.
• They don’t have GPT-5 after 14 months of trying.
• The most important figure in the blogpost is attached below. And the most important thing about the figure is that 4o is not a lot different from Turbo, which is not hugely different from 4.
• Lots of quirky errors are already being reported, same as ever. (See e.g., examples from @RosenzweigJane and @benjaminjriley.)
• OpenAI has presumably pivoted to new features precisely because they don’t know how produce the kind of capability advance that the “exponential improvement” would have predicted.
• Most importantly, each day in which there is no GPT-5 level model–from OpenAI or any of their well-financed, well-motivated competitors—is evidence that we may have reached a phase of diminishing returns.

275

367

edparso retweeted

Billy Gallagher

@GallagherBilly

about 2 years ago

Best advice I ever got: think like a VC when looking for tech job. It led me to joining Rippling ($11B+) before its Series A. It changed my life, so I built a free tool to help others. We expect that each company listed on Prospect will grow at least 2X over the next 4 years:

GallagherBilly's tweet photo. Best advice I ever got: think like a VC when looking for tech job.

It led me to joining Rippling ($11B+) before its Series A. It changed my life, so I built a free tool to help others.

We expect that each company listed on Prospect will grow at least 2X over the next 4 years:

364

407

15M

Ed Parsons @edparso

over 2 years ago

Discovered an intriguing concept - Kahneman-Tversky Optimization (KTO). It's not just about aligning LLMs more efficiently and cheaply, but also a leap towards valuing implicit feedback over explicit feedback. Really cool stuff! #KTO #AI #ImplicitFeedback https://t.co/zwZmpCFRZy

Ed Parsons @edparso

over 2 years ago

Seems this is the season to push out models before they are ready

Santiago

@svpino

over 2 years ago

Google, this is embarrassing. You published an impressive video showing Gemini answering your questions. It looked awesome. It looked real-time. But it was a lie. None of that happened as recorded and presented to the public. Instead, you cherry-picked frames and edited a video in a way you knew it would impress people. That's misleading, and anyone who participated in this charade should be embarrassed. I hope I'm wrong. Read their disclaimer: "We've been capturing footage to test it on a wide range of challenges, showing it a series of images, and asking it to reason about what it sees."

svpino's tweet photo. Google, this is embarrassing.

You published an impressive video showing Gemini answering your questions. It looked awesome. It looked real-time.

But it was a lie. None of that happened as recorded and presented to the public.

Instead, you cherry-picked frames and edited a video in a way you knew it would impress people.

That's misleading, and anyone who participated in this charade should be embarrassed.

I hope I'm wrong.

Read their disclaimer:

"We've been capturing footage to test it on a wide range of challenges, showing it a series of images, and asking it to reason about what it sees."

254

400

750

Ed Parsons @edparso

over 2 years ago

@DrJimFan I wonder if the star part of Q* also has reference to A* search. The simplest way I’ve thought of that is adding another model for the A* heuristic to speed up the exploration of the ToT. Would this be the job of the Value NN?

edparso retweeted

Greg Kamradt

@GregKamradt

over 2 years ago

Claude 2.1 (200K Tokens) - Pressure Testing Long Context Recall We all love increasing context lengths - but what's performance like? Anthropic reached out with early access to Claude 2.1 so I repeated the “needle in a haystack” analysis I did on GPT-4 Here's what I found: Findings: * At 200K tokens (nearly 470 pages), Claude 2.1 was able to recall facts at some document depths * Facts at the very top and very bottom of the document were recalled with nearly 100% accuracy * Facts positioned at the top of the document were recalled with less performance than the bottom (similar to GPT-4) * Starting at ~90K tokens, performance of recall at the bottom of the document started to get increasingly worse * Performance at low context lengths was not guaranteed So what: * Prompting Engineering Matters - It’s worth tinkering with your prompt and running A/B tests to measure retrieval accuracy * No Guarantees - Your facts are not guaranteed to be retrieved. Don’t bake the assumption they will into your applications * Less context = more accuracy - This is well know, but when possible reduce the amount of context you send to the models to increase its ability to recall * Position Matters - Also well know, but facts placed at the very beginning and 2nd half of the document seem to be recalled better Why run this test?: * I’m a big fan of Anthropic! They are helping to push the bounds on LLM performance and creating powerful tools for the world * As a practitioner of LLMs, it’s important to build an intuition for how they work, where they excel and their limits * Tests like these, while not bulletproof, help showcase real world examples and get a feeling for how they work. The goal is to transfer this knowledge to productive use cases Overview of the process: * Use Paul Graham essays as ‘background’ tokens. With 218 essays it’s easy to get up to 200K tokens (repeated essays when necessary) * Place a random statement within the document at various depths. Fact used: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” * Ask Claude 2.1 to answer this question only using the context provided * Evaluate Claude 2.1s answer with GPT-4 using @langchain evals * Rinse and repeat for 35x document depths between 0% (top of document) and 100% (bottom of document) (sigmoid distribution) and 35x context lengths (1K Tokens > 200K Tokens) Next Steps To Take This Further: * For rigor, one should do a key:value retrieval step. However for relatability I did a San Francisco line within PGs essays for clarity and practical relevance * Repeat test multiple times for increased statistical significance Notes: * Amount Of Recall Matters - The model's performance is hypothesized to diminish when tasked with multiple fact retrievals or when engaging in synthetic reasoning steps * Changing your prompt, question, fact to be retrieved and background context will impact performance * The Anthropic team reached out and offered credits to repeat this test. They also offered prompt advice to maximize performance. It's important to clarify that their involvement was strictly logistical. The integrity and independence of the results were maintained, ensuring that the findings reflect my unbiased evaluation and are not influenced by their support. * This test cost ~$1,016 for API calls ($8 per million tokens)

GregKamradt's tweet photo. Claude 2.1 (200K Tokens) - Pressure Testing Long Context Recall

We all love increasing context lengths - but what's performance like?

Anthropic reached out with early access to Claude 2.1 so I repeated the “needle in a haystack” analysis I did on GPT-4

Here's what I found:

Findings:
* At 200K tokens (nearly 470 pages), Claude 2.1 was able to recall facts at some document depths
* Facts at the very top and very bottom of the document were recalled with nearly 100% accuracy
* Facts positioned at the top of the document were recalled with less performance than the bottom (similar to GPT-4)
* Starting at ~90K tokens, performance of recall at the bottom of the document started to get increasingly worse
* Performance at low context lengths was not guaranteed

So what:
* Prompting Engineering Matters - It’s worth tinkering with your prompt and running A/B tests to measure retrieval accuracy
* No Guarantees - Your facts are not guaranteed to be retrieved. Don’t bake the assumption they will into your applications
* Less context = more accuracy - This is well know, but when possible reduce the amount of context you send to the models to increase its ability to recall
* Position Matters - Also well know, but facts placed at the very beginning and 2nd half of the document seem to be recalled better

Why run this test?:
* I’m a big fan of Anthropic! They are helping to push the bounds on LLM performance and creating powerful tools for the world
* As a practitioner of LLMs, it’s important to build an intuition for how they work, where they excel and their limits
* Tests like these, while not bulletproof, help showcase real world examples and get a feeling for how they work. The goal is to transfer this knowledge to productive use cases

Overview of the process:
* Use Paul Graham essays as ‘background’ tokens. With 218 essays it’s easy to get up to 200K tokens (repeated essays when necessary)
* Place a random statement within the document at various depths. Fact used: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”
* Ask Claude 2.1 to answer this question only using the context provided
* Evaluate Claude 2.1s answer with GPT-4 using @langchain evals
* Rinse and repeat for 35x document depths between 0% (top of document) and 100% (bottom of document) (sigmoid distribution) and 35x context lengths (1K Tokens > 200K Tokens)

Next Steps To Take This Further:
* For rigor, one should do a key:value retrieval step. However for relatability I did a San Francisco line within PGs essays for clarity and practical relevance
* Repeat test multiple times for increased statistical significance

Notes:
* Amount Of Recall Matters - The model's performance is hypothesized to diminish when tasked with multiple fact retrievals or when engaging in synthetic reasoning steps
* Changing your prompt, question, fact to be retrieved and background context will impact performance
* The Anthropic team reached out and offered credits to repeat this test. They also offered prompt advice to maximize performance. It's important to clarify that their involvement was strictly logistical. The integrity and independence of the results were maintained, ensuring that the findings reflect my unbiased evaluation and are not influenced by their support.
* This test cost ~$1,016 for API calls ($8 per million tokens)

158

541

Ed Parsons

@edparso

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users