John Keiser

@jkeiser2

Mad scientist / software engineer. In spare time, I design languages and unicycle with my kids. I helped start @CampQuestNW.

Seattle, WA

Joined July 2009

244 Following

656 Followers

1.6K Posts

jkeiser2 retweeted

Adam Jacob

@adamhjk

over 1 year ago

System Initiative is the future of DevOps Automation. It's ready to replace Infrastructure as Code today. https://t.co/ZYPDSVNZpk - sign up at https://t.co/kmHqYAb1Ft!

153

39K

John Keiser @jkeiser2

about 2 years ago

@spacegho_st Yeah, true. Your use case seems to split up the file and process it in parallel, which implies you are copying parts of the file to other machines. If you controlled the producer you could even *generate* each machine's part locally and avoid copy cost.

John Keiser @jkeiser2

about 2 years ago

@spacegho_st Parallel processing is harder if records split across chunks. Ensuring a clean split during JSON generation seems easiest. Surgically scanning for record boundaries when splitting could work too. Could also overlap chunks if records have max size.

John Keiser @jkeiser2

about 2 years ago

@spacegho_st We called it "on demand" in simdjson. It doesn't use schema explicitly, but takes advantage of the fact that most code that consumes JSON already encodes the schema (loop this array, read string with key "name" and int "age" ...).

Who to follow

Jamie Stormbreaker

@resetexe

Video Game Founder, Designer, Programmer, and COO @ One More Game. Let’s connect @ https://t.co/N3jVEsPNrp

Christopher Webber

@cwebber

Ride HEAD or die trying. Shaver of Yaks and lover of Ops. Dir Plat Eng, Formstack. Former Tenable, Chef, UCR. Rotarian. I speak for myself, not where I work.

John Keiser @jkeiser2

about 2 years ago

@spacegho_st I suspect simdjson will beat most custom jobs up until the point where they are so big they need streaming parsing (which is your case). Parallel parsing is also an interesting case, and I've got thoughts on that but haven't had time to write them down in C++ :)

John Keiser @jkeiser2

about 2 years ago

@spacegho_st What sort of duel?

John Keiser @jkeiser2

about 2 years ago

@justinschiebs Is it permissible to stub someone's toe as part of an art piece that inspires contentment, leading many people not to stub others' toes?

John Keiser @jkeiser2

about 2 years ago

@justinschiebs @GhostCoase Do you mean it might always be the less moral option, or that it depends? Your original question reads like "is it *ever* ok to prioritize aesthetic good over harm?" And if it depends, the answer to that is yes.

John Keiser @jkeiser2

about 2 years ago

@PhilAndrew61181 @lemire Yep, if you access the keys in a different order than the json it will circle round. There isn't an explicit stack: outerObj and b are variables that keep track of their own start positions.

John Keiser @jkeiser2

about 2 years ago

@spacegho_st @lemire It's clever since it is still valid json. But the pretty printing step has to be serial, no? Are you going to process the pretty printed JSON more than once? Otherwise I'm not sure how you gain much ...

John Keiser @jkeiser2

about 2 years ago

@spacegho_st @lemire Yep, your trick is essentially to store the depth of each value using the number of tabs before it. This particularly allows you to parallelize the parse since you can start anywhere and know what you are looking at (assuming your schema is simple enough).

John Keiser @jkeiser2

about 2 years ago

@PhilAndrew61181 @lemire No, since you just finished reading the email, it starts looking for "password" key starting from the comma (e.g {"email": "...", "password": "..."})

John Keiser @jkeiser2

about 2 years ago

@the_yamiteru @lemire When you use ondemand you get a cursor representing an object or array; jsonObj["email"] puts the cursor at the "," after the email, jsonObj["password"] scans from that point to find the password and puts the cursor at the "}" or "," after the password.

John Keiser @jkeiser2

about 2 years ago

@PhilAndrew61181 @lemire It's not designed around a skeleton. ondemand myobject["key"] scans through the JSON (counting braces) until it finds "key" in the current JSON object, skips (and validates) the ":", and puts the parser at the start of the value so you can work with it.

John Keiser @jkeiser2

about 2 years ago

@clintonmead @lemire @etorreborre On Demand is a streaming dom, and will check the closing brace when you get to the end of the array or object you are reading. SIMD isn't really the trick here: On Demand reads whole documents faster than simdjson's SIMD DOM parser, even using the same SIMD code.

John Keiser @jkeiser2

about 2 years ago

@PhilAndrew61181 @lemire In most cases dom is slower, because you have to iterate the whole document a second time (in its dom form) when you actually use it. dom can get faster when you *need* to iterate the document multiple times in a generic form.

John Keiser @jkeiser2

about 2 years ago

@the_yamiteru @lemire To use a typical one-shot JSON parser, you have to iterate the document twice: first to create a generic object that can hold any JSON type, and second to retrieve "email" and "password" keys from that generic structure. On Demand lets you do it once, strictly faster.

John Keiser @jkeiser2

about 2 years ago

@adamhjk I'm really looking forward to seeing infrastructure management that actually cares about its user interface!

John Keiser

@jkeiser2

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users