From Operational Data to a Data Lake in Minutes!
Data Nessie is a lightweight migration service which moves data from any database into the Amazon Simple Storage Service (Amazon S3). Data Nessie automatically builds and maintains a Data Lake by surfacing all the extracted database tables in Amazon Athena, ready for analysis. Any number of databases can be migrated and synchronised with S3. Data Nessie even keeps the history of all the operational changes.
With Data Nessie there is;
- No complex ETL to write
- No legacy code to maintain
- No training required
- No operational system changes
- No database changes
- No new servers or licences required
- No need for invasive database triggers
The Data Nessie Migration Pipeline
Data Nessie does Change Data Capture (CDC) replication without the need for binary logs. Instead of looking at logs, the service polls the database tables directly hunting for row changes.
Data Nessie compares the value for a specified column in each source table (e.g timestamp) to the table already stored directly in the S3 data lake. When rows have been added or inserted into the operational database they are extracted and uploaded to S3.
Initially Data Nessie will copy the full database table into S3, subsequent data lake refreshes will only copy incremental/delta table changes. Each Data Nessie refresh can run on a schedule and there is no limit on schedules or database sources.
Data Nessie automatically builds the AWS Glue Catalogue needed to query S3 and uses AWS Athena to query the data lake tables.
Data Nessie Overview
Here are a few of the business benefits from using Data Nessie for your Data Lake;
- Faster development; no ETL to code, no tool to learn and instant analysis of initial extracts and daily deltas.
- Cheaper development; fewer ETL resources required, no tool licence to pay for and no additional database licences required.
- Flexible development, reconfigure the data lake in minutes.
- Zero technical debt to accumulate
- Full service support from our Data Lake and AWS experts
- One Data Nessie service can populate your Data Lake from all operational systems, providing enterprise analysis at scale.
Data Nessie lives and breaths Amazon AWS, hence the service fits seamlessly into your existing cloud environment;
- Data Nessie is delivered as an AWS AMI on the AWS Marketplace
- All logs are available in AWS CloudWatch
- Passwords are stored in AWS Secrets Manager
- Server access is governed by AWS Policies
- All database drivers are pulled from AWS S3
- No datasets or passwords are stored permanently inside the Data Nessie AMI
- Data Nessie automatically generates the AWS Glue Catalogue
- Data Nessie generates a Data Lake compatible with AWS Athena
- Data Nessie MetaData is exported to AWS S3 for backup and analysis with AWS QuickSight
The following databases are currently supported from release 1.0;
- AWS Aurora
- AWS Aurora Serverless
- MySQL RDS
- MySQL self-managed and/or on-premise
We are busy testing Data Nessie against all the Postgres variants and MS SQL Server ™, they will be supported in the next release.
Oracle ™ will be supported on request.
Data Nessie polls individual tables for changes therefore each table copied into the data lake must have a field indicating the row that has changed. This Change Data Capture (CDC) column can be a timestamp, a sequence or in the case of MS SQL Server a rowversion
Polling CDC solutions can miss some operational changes if multiple changes happen to records between polls. However, as the database server impact of Data Nessie is lite the poll can be frequent.
Currently the Data Nessie service is not able to identify and migrate rows which have been deleted from operational databases. Data Nessie is continually improving and we are working on a reconciliation process to resolve this limitation in a future release.
Fresh From Our Blog
What’s wrong with the Amazon Data Migration Service (AWS DMS)?
Not a lot in fact! It works perfectly under perfect operational circumstances. The problem is that few operational databases are perfect either in their schema design or configuration, often for pragmatic reasons like cost and performance. General Issues Primary Keys: Ongoing replication requires a primary…