transform-design

ses-transformer proof of concept (POC)

We'll build the solution on a laptop for development - with the goal of porting it to the Amazon s3 Parquet filesystem, and using Amazon Athena as our database

For the development environment today, well need:

POC code

The code is available here: Github

The files for this project are organized as follows:

  • An input directory for some "input" data that we can test with

  • A sql directory for SQL statements we'll use to query the final data

  • A src directory for the pyspark code that will transform the raw SES events into our records

The program will create:

  • An output_local directory - when we run the non-streaming version

  • An output_streaming directory for the streaming examples

// Directory structure

β”œβ”€β”€ input                           --- dev test files
β”‚   β”œβ”€β”€ bounce
β”‚   β”‚   └── bounce.json
β”‚   β”œβ”€β”€ click
β”‚   β”‚   └── click.json
β”‚   β”œβ”€β”€ complaint
β”‚   β”‚   └── complaint.json
β”‚   β”œβ”€β”€ delivery
β”‚   β”‚   └── delivery.json
β”‚   β”œβ”€β”€ open
β”‚   β”‚   └── open.json
β”‚   β”œβ”€β”€ reject
β”‚   β”‚   └── reject.json
β”‚   └── send
β”‚       └── send.json
β”‚
β”œβ”€β”€ sql                            --- sql to create pivot
β”‚   β”œβ”€β”€ 01-mock_request_data.sql
β”‚   β”œβ”€β”€ 02_pivot_base.sql
β”‚   β”œβ”€β”€ 03_ses_pivot.sql
β”‚   └── 04_join_pivot.sql
β”‚
β”œβ”€β”€ src                            --- pyspark code
   β”œβ”€β”€ main.py                             -- main
   β”œβ”€β”€ readers.py                          -- read and process files
   β”œβ”€β”€ transformer.py                      -- perform transformations
   └── writers.py                          -- write to jdbc, batch and stream writers

Last updated

Was this helpful?