A breakdown of ๐๐ฎ๐๐ฎ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ ๐ถ๐ป ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ๐ ๐ And yes, it can also be used for LLM based systems!
It is critical to ensure Data Quality and Integrity upstream of ML Training and Inference Pipelines, trying to do that in the downstream systems will cause unavoidable failure when working at scale.
There is a ton of work to be done on the Data Lake or LakeHouse layer. ๐ฆ๐ฒ๐ฒ ๐๐ต๐ฒ ๐ฒ๐ ๐ฎ๐บ๐ฝ๐น๐ฒ ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ ๐ฏ๐ฒ๐น๐ผ๐.
๐๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ ๐ข๐ณ๐ค๐ฉ๐ช๐ต๐ฆ๐ค๐ต๐ถ๐ณ๐ฆ ๐ง๐ฐ๐ณ ๐ข ๐ฑ๐ณ๐ฐ๐ฅ๐ถ๐ค๐ต๐ช๐ฐ๐ฏ ๐จ๐ณ๐ข๐ฅ๐ฆ ๐ฆ๐ฏ๐ฅ-๐ต๐ฐ-๐ฆ๐ฏ๐ฅ ๐ฅ๐ข๐ต๐ข ๐ง๐ญ๐ฐ๐ธ:
๐ญ: Schema changes are implemented in version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.
Applications push generated Data to Kafka Topics:
๐ฎ: Events emitted directly by the Application Services.
๐ This also includes IoT Fleets and Website Activity Tracking.
๐ฎ.๐ญ: Raw Data Topics for CDC streams.
๐ฏ: A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry.
๐ฐ: Data that does not meet the contract is pushed to Dead Letter Topic.
๐ฑ: Data that meets the contract is pushed to Validated Data Topic.
๐ฒ: Data from the Validated Data Topic is pushed to object storage for additional Validation.
๐ณ: On a schedule Data in the Object Storage is validated against additional SLAs in Data Contracts and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.
๐ด: Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering.
๐ด.๐ญ: Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5).
๐ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.
๐ต: High Quality Data is used in Machine Learning Training Pipelines.
๐ญ๐ฌ: The same Data is used for Feature Serving in Inference.
Note: ML Systems are plagued by other Data related issues like Data and Concept Drifts. These are silent failures and while they can be monitored, we canโt include it in the Data Contract.
Let me know your thoughts! ๐
#AI #MachineLearning #DataEngineering
New Active Directory Mindmap v2025.03! ๐
๐ Readable version: https://t.co/gQd6WsLnzG
๐ง Now fully generated from markdown filesโway easier to update and maintain!
๐ก Got improvements? PRs welcome! ๐ https://t.co/o52PAmek7b
@Antonlovesdnb I just purchased the course today will this be part of the current course or new course that I will have to purchase. Would love this version over the Sumo logic version as it fulfills my training needs more so.
#tools#Red_Team_Tactics
1. Embed a payload inside a PNG file
https://t.co/z7ui1I9c1b
2. Early Cascade Injection: From Windows Process Creation to Stealthy Injection
https://t.co/YtCdxuPNBP
3. Concealing payloads in URL credentials
https://t.co/KiP25RQKSs
Made a new tool for a test I was doing. Decided to share with everyone, added it to my toolbox, for sure. It's like having X-ray vision into JS files.
Crazy, some of the endpoints it pulled out that were never seen before.
https://t.co/xaBJhudBgY
Example: