System Initiative is the future of DevOps Automation. It's ready to replace Infrastructure as Code today. https://t.co/ZYPDSVNZpk - sign up at https://t.co/kmHqYAb1Ft!
@spacegho_st Yeah, true. Your use case seems to split up the file and process it in parallel, which implies you are copying parts of the file to other machines. If you controlled the producer you could even *generate* each machine's part locally and avoid copy cost.
@spacegho_st Parallel processing is harder if records split across chunks. Ensuring a clean split during JSON generation seems easiest. Surgically scanning for record boundaries when splitting could work too. Could also overlap chunks if records have max size.
@spacegho_st We called it "on demand" in simdjson. It doesn't use schema explicitly, but takes advantage of the fact that most code that consumes JSON already encodes the schema (loop this array, read string with key "name" and int "age" ...).
@spacegho_st I suspect simdjson will beat most custom jobs up until the point where they are so big they need streaming parsing (which is your case). Parallel parsing is also an interesting case, and I've got thoughts on that but haven't had time to write them down in C++ :)
@justinschiebs Is it permissible to stub someone's toe as part of an art piece that inspires contentment, leading many people not to stub others' toes?
@justinschiebs@GhostCoase Do you mean it might always be the less moral option, or that it depends? Your original question reads like "is it *ever* ok to prioritize aesthetic good over harm?" And if it depends, the answer to that is yes.
@PhilAndrew61181@lemire Yep, if you access the keys in a different order than the json it will circle round. There isn't an explicit stack: outerObj and b are variables that keep track of their own start positions.
@spacegho_st@lemire It's clever since it is still valid json. But the pretty printing step has to be serial, no? Are you going to process the pretty printed JSON more than once? Otherwise I'm not sure how you gain much ...
@spacegho_st@lemire Yep, your trick is essentially to store the depth of each value using the number of tabs before it. This particularly allows you to parallelize the parse since you can start anywhere and know what you are looking at (assuming your schema is simple enough).
@PhilAndrew61181@lemire No, since you just finished reading the email, it starts looking for "password" key starting from the comma (e.g
{"email": "...", "password": "..."})
@the_yamiteru@lemire When you use ondemand you get a cursor representing an object or array; jsonObj["email"] puts the cursor at the "," after the email, jsonObj["password"] scans from that point to find the password and puts the cursor at the "}" or "," after the password.
@PhilAndrew61181@lemire It's not designed around a skeleton. ondemand myobject["key"] scans through the JSON (counting braces) until it finds "key" in the current JSON object, skips (and validates) the ":", and puts the parser at the start of the value so you can work with it.
@clintonmead@lemire@etorreborre On Demand is a streaming dom, and will check the closing brace when you get to the end of the array or object you are reading. SIMD isn't really the trick here: On Demand reads whole documents faster than simdjson's SIMD DOM parser, even using the same SIMD code.
@PhilAndrew61181@lemire In most cases dom is slower, because you have to iterate the whole document a second time (in its dom form) when you actually use it. dom can get faster when you *need* to iterate the document multiple times in a generic form.
@the_yamiteru@lemire To use a typical one-shot JSON parser, you have to iterate the document twice: first to create a generic object that can hold any JSON type, and second to retrieve "email" and "password" keys from that generic structure. On Demand lets you do it once, strictly faster.