anjijava16/Databricks_fs
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Β | Β | |||
Repository files navigation
# Databricks_fs Like π Share π€ β³οΈ Different Layers in Databricks Lakehouse Architecture? β³οΈ β³οΈ Landing Layer: (native Format) - β This layer is an optional and depends on source systems and data. β Landing is just container in data lake to store raw source data. β This layer represents the area where data land from the data source before processing into delta layers. β Different external systems data ingesting in data lake in native foramt. β Landing is just source systems data in native files like (csv,json,xml,parquet...) β Landing data can be structured , semi-strucutred and un-strucutred files. β Landing data comes from Different sources as a Batch/Streaming Process. β³οΈ Bronze layer (Delta Format) β source data converted and loaded as delta format β everyday data will be appended in delta tables. β bronze tabels are partitioned with updated_date/load_Date to get better performance. β Different external source systems data managed in bronze layer. β The table structures in this layer correspond to the source system table structures "as-is,". β Bronze tabels will have additional metadata columns that capture the load date/time, process ID, etc. β The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage). β Bronze can be used for reload scenarios in future. All Historical data will be managed here with audit columns. β³οΈ Silver Layer (Delta Format) β Uses DeltaLake tables (with SQL table names) β Preserves grain of original data (no aggregation) β Eliminates duplicate records β Production schema enforced β Data quality checks passed β Corrupt data quarantined β Data stored to support production workloads β Optimized for long-term retention and ad-hoc queries β Validate data quality and schema β Enrich and transform data β Optimize data layout and storage for downstream queries β Provide single source of truth for analytics β³οΈ Gold layer (Delta Format) β Validated and business-level tables β lakehouse is typically organized in consumption-ready "project-specific" databases. β The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. β The final layer of data transformations and data quality rules are applied here. β Final presentation layer of projects are business data wise models. β We see a lot of Kimball style star schema-based data models or Inmon style Data marts fit in this Gold Layer of the lakehouse. β³οΈ Benefits of multiple layers β Simple data model β Easy to understand and implement β Enables incremental ETL β Can recreate your tables from raw data at any time β ACID transactions, time travel