🐜
tiny engines
  • Tiny Engines
  • Personal Website Home
  • NFL machine learning capstone
    • project presentation
    • project proposal
    • project approach
    • project structure
    • project workflow
    • project summary
    • project code
  • Onboarding new hires
    • motivation
    • the project
    • the mailer service
      • mailer setup
      • walk-through
      • unit testing
      • testing the controller
      • testing the handler
      • testing the mailer
      • integration testing
      • integration example
      • acceptance testing
      • acceptance example
      • documenting the API
      • test coverage
      • performance testing
      • mutation testing
      • grammar checking
    • the event listener
      • design
      • webhook setup
      • walk-through
      • testing
      • the kafka connector
  • Walk-throughs
    • spark streaming hld
      • background
      • architecture
      • threat
      • project
      • transform-design
      • transform-poc
      • query-poc
    • kafka walkthroughs
    • java futures
      • async servers
      • async clients
      • async streams
Powered by GitBook
On this page
  • ses-transformer proof of concept (POC)
  • POC code

Was this helpful?

  1. Walk-throughs
  2. spark streaming hld

transform-design

PreviousprojectNexttransform-poc

Last updated 3 years ago

Was this helpful?

ses-transformer proof of concept (POC)

We'll build the solution on a laptop for development - with the goal of porting it to the Amazon s3 Parquet filesystem, and using Amazon Athena as our database

For the development environment today, well need:

POC code

The code is available here:

The files for this project are organized as follows:

  • An input directory for some "input" data that we can test with

  • A sql directory for SQL statements we'll use to query the final data

  • A src directory for the pyspark code that will transform the raw SES events into our records

The program will create:

  • An output_local directory - when we run the non-streaming version

  • An output_streaming directory for the streaming examples

// Directory structure

├── input                           --- dev test files
│   ├── bounce
│   │   └── bounce.json
│   ├── click
│   │   └── click.json
│   ├── complaint
│   │   └── complaint.json
│   ├── delivery
│   │   └── delivery.json
│   ├── open
│   │   └── open.json
│   ├── reject
│   │   └── reject.json
│   └── send
│       └── send.json
│
├── sql                            --- sql to create pivot
│   ├── 01-mock_request_data.sql
│   ├── 02_pivot_base.sql
│   ├── 03_ses_pivot.sql
│   └── 04_join_pivot.sql
│
├── src                            --- pyspark code
   ├── main.py                             -- main
   ├── readers.py                          -- read and process files
   ├── transformer.py                      -- perform transformations
   └── writers.py                          -- write to jdbc, batch and stream writers
Github