It needs some sort of weighting - eg prioritizing actual elements over the text in it - otherwise it's ok
Oh and the backend is just a bunch of json files lmao
How it was made:
Exported frames from clips (2 frames per sec)
Give them to Gemini to get a description of what is going on and their embeddings
Search:
It uses the same Gemini to embed the query and then just searches vectors