I want to train a "small" model.
Or at least gather the practical knowledge of what it takes by attempting it.
I've been so inspired reading about @vikhyatk's moondream and fpgaminer's joytag. Individuals showing what can be accomplished is exciting.
I'll probably regret picking this, but video as input caught my attention.
I work with a few golf brands in my day job, and the industry has always fascinated me.
> $7.5B in '22
> CAGR of 5.78%
> growing 18-34yo category
It's also just such an odd sport to be so popular.
I think the layers of contradiction add to the appeal. The WM Phoenix Open is a good example of this.
Anyway, swing analysis.
Set up your phone, take a video of your swing, and get feedback on technique and an improvement strategy.
Half of this process is trivial with GPT4V. Pull and annotate the correct frames and get a good report output with some prompting.
But the other half is the interesting part!
until trying to get the full pipeline to run on device
How do you go from frame sequence to correctly selected frames with annotation?
The idea isn't novel, and there are probably complete products available, but I think I've wrapped my head around the high-level requirements, and it offers interesting challenges, including...
Action Detection/Event Spotting
We probably need around 8 event frames from the original video + a no-event to recreate the MVP:
1. Setup: a frame before the backswing begins
2. Start-Backswing: club shaft parallel to the ground
3. Mid-Backswing: arms parallel to the ground
4. Top: change in direction from backswing to downswing
5. Mid-Downswing: arms parallel to the ground
6. Impact: club head touches the ball
7. Mid-Follow-Through: club shaft parallel to the ground
8. Finish: a frame before the final pose is relaxed
Detecting any of these events using a single frame seems difficult.
> correct setup frame or just a practice swing?
> mid-downswing or mid-backswing?
Time context will need to be considered.
How would applying attention change the scope?
Does attention make single-frame possible?
How would attention mechanisms impact evals over CNN + LSTM?
Intuition says it's possible to achieve a deeper and more nuanced understanding of the swing dynamics when considering a transformer based architecture.
But what the fuck do I know? Little.
I want to find out.
Writing this down helps me get out of my head, so hope to continue.
Next steps:
> get a better understanding of current related research
> use the GolfDB as initial dataset for testing - better understand constraints
> narrow architecture considerations
> listen to smarter people to increase current smartness
A lot of this came from a 2019 paper, "GolfDB: A Video Database for Golf Swing Sequencing" arXiv:1903.06528
If you're reading this (wtf are you doing) and have thoughts, insight, direction, papers to read, show how this is a waist of time, anything... I would greatly appreciate it.
<3
I want to train a "small" model.
Or at least gather the practical knowledge of what it takes by attempting it.
I've been so inspired reading about @vikhyatk's moondream and fpgaminer's joytag. Individuals showing what can be accomplished is exciting.
I'll probably regret picking this, but video as input caught my attention.
I work with a few golf brands in my day job, and the industry has always fascinated me.
> $7.5B in '22
> CAGR of 5.78%
> growing 18-34yo category
It's also just such an odd sport to be so popular.
I think the layers of contradiction add to the appeal. The WM Phoenix Open is a good example of this.
Anyway, swing analysis.
Set up your phone, take a video of your swing, and get feedback on technique and an improvement strategy.
Half of this process is trivial with GPT4V. Pull and annotate the correct frames and get a good report output with some prompting.
But the other half is the interesting part!
until trying to get the full pipeline to run on device
How do you go from frame sequence to correctly selected frames with annotation?
The idea isn't novel, and there are probably complete products available, but I think I've wrapped my head around the high-level requirements, and it offers interesting challenges, including...
Action Detection/Event Spotting
We probably need around 8 event frames from the original video + a no-event to recreate the MVP:
1. Setup: a frame before the backswing begins
2. Start-Backswing: club shaft parallel to the ground
3. Mid-Backswing: arms parallel to the ground
4. Top: change in direction from backswing to downswing
5. Mid-Downswing: arms parallel to the ground
6. Impact: club head touches the ball
7. Mid-Follow-Through: club shaft parallel to the ground
8. Finish: a frame before the final pose is relaxed
Detecting any of these events using a single frame seems difficult.
> correct setup frame or just a practice swing?
> mid-downswing or mid-backswing?
Time context will need to be considered.
How would applying attention change the scope?
Does attention make single-frame possible?
How would attention mechanisms impact evals over CNN + LSTM?
Intuition says it's possible to achieve a deeper and more nuanced understanding of the swing dynamics when considering a transformer based architecture.
But what the fuck do I know? Little.
I want to find out.
Writing this down helps me get out of my head, so hope to continue.
Next steps:
> get a better understanding of current related research
> use the GolfDB as initial dataset for testing - better understand constraints
> narrow architecture considerations
> listen to smarter people to increase current smartness
A lot of this came from a 2019 paper, "GolfDB: A Video Database for Golf Swing Sequencing" arXiv:1903.06528
If you're reading this (wtf are you doing) and have thoughts, insight, direction, papers to read, show how this is a waist of time, anything... I would greatly appreciate it.
<3
Graph of decision quality among professional Go players. A sudden increase after AlphaGo. It is not only because they are learning from the AI. Players are suddenly inventing new moves at a faster rate too!
I'm (relatively) new to everything required to do this...
But I built the thing anyway:
✏️ Homework guides for parents from an image.
https://t.co/aXyIO2ivxW
Do you think this is helpful?
🛠️ @langchain@hwchase17
https://t.co/w4lm3k0J3L
#buildinpublic
Looking for feedback!
I'm new to all of this.
- product development
- frontend
- backend
- artificial intelligence
I started to code a year ago and have been learning typescript over the past 6 months.
https://t.co/aXyIO2ivxW
Free to try!
Let me know what you think.
4. Healthcare isn't "lindy"
Private health insurance popularized in the mid 1950s, Medicare/Medicaid was invented in the 1960s.
Almost all of the modern healthcare system is younger than the executives that run it. We shouldn't expect it to exist unchanged for the next century.
Anyone else have numerated list outputs in chat history threads from gpt-4 on chatgpt revert to ascending order in the last 48 hours when they were received in descending?
I only noticed because it was odd seeing the originals start at 9 and end at 0.
Responses aren't saved as strings?
Or just a gpt-markdown change?
Weird db thing?
I want to understand!
@F_Sammarco All kinds of things, but the most common are internal or customer facing chat interfaces, analyzing text data (e.g. product reviews), and translation.
AI Twitter is flooded with low-quality stuff recently. No, GPT is not “dethroned”. And thin wrapper apps are not “insane”. At all.
I feel obligated to surface some quality posts I bookmarked. Every one of them should've been promoted 10x, but ¯\_(ツ)_/¯
In no particular order:
In case you missed it from yesterday... our "Agents in Production" webinar is now on YouTube!
Big thanks to @sjwhitmore@DivGarg9@devstein64 for joining!
https://t.co/5g2IwUlstC
@sama
Have you looked into what % of your codebase has been produced by a language model in the last year?
sorry if you've answered this, i'm lazy and hopeful?