What’s wrong with the Amazon Data Migration Service (AWS DMS)?
Not a lot in fact! It works perfectly under perfect operational circumstances. The problem is that few operational databases are perfect either in their schema design or configuration, often for pragmatic reasons like cost and performance.
- Primary Keys: Ongoing replication requires a primary key on every table so that DMS can track updates. However, without a primary key DMS is still able to track some things such as inserts. Does every table in your application have a primary key?
- Schema Mismatch between IDL/CDC: With an S3 target, DMS adds an additional column to the very start of the incremental CSV file to identify rows which were updated (U), inserted (I), or deleted (D). This column does not appear in the DMS Initial Data Loads (IDL) causing a column misalignment in Athena (this bug may now be fixed).
- Glue Data Types: Glue does not factor in the data type at the source when classifying data in Athena to query. If for example a number is stored in a string field, (e.g. phone numbers with 0 at the start) then Glue will think these are numbers and classify them as such. This isn’t a bug, but it can create mismatches between source and target if data types don’t align. Possible workarounds include manually changing the affected columns back to the desired data type.
- Additional Licences: DMS may require changes to the database configuration and/or type. These changes could be costly in terms of licence and administration.
- Newline characters: Athena/Glue is very aggressive when parsing newline characters within fields, creating mismatched columns when displayed in Athena. The current workaround is to replace newline characters with spaces, in the operational database, before loading with DMS.; this is not always practical.
- Data Quality: With DMS any garbage in the operational database will find its way into the Data Lake. Complex Apache Spark jobs will be needed to clean the data in each S3 table as Athena may be compromised.
- GDPR: Personal identifiable information is best removed before it enters the Data Lake otherwise complex Apache Spark jobs may be needed to remove.
- Parquet S3 Support: it’s possible with Amazon DMS but it is fiddly (see this blog for details; https://aws.amazon.com/blogs/database/announcing-the-support-of-parquet-data-format-in-aws-dms-3-1-3/ )
- Support: Do you have the resources to support a complex CDC solution?
- Access Constraints: Internal server access may not be available via hosted database provider
Microsoft SQL Server specific issues
- Service account: A service account may be required; in order to use the native replication features there may need to be a service account made with ‘dbowner’ or ‘sysadmin’ privileges – not ideal to have a service account with this much power.
- Backup model: For a SQL Server source, the native replication feature reads from the transaction log which is not backed up under the ‘Simple’ backup model. Only Full or Bulk-Logged models back up the transaction log. If the transaction log is not backed up, DMS is unrecoverable in the event of failure and the source and target could become out of sync.
- Complex: Turning on and managing MS SQL Server replication is a core database administrator task, not for the faint hearted. See https://www.brentozar.com/archive/2013/09/transactional-replication-change-tracking-data-capture/
- Upgrade Required: The latest replication features are in the latest version and may be enterprise only, so be prepared to dig deep into your pockets!
Data Nessie works by polling the data using an SQL interface via JDBC, there is no need to configure replication or make any operational adjustments. It’s a soft touch approach to change data capture (CDC).
This is the only approach available in certain cases, i.e. with Amazon Aurora Serverless where there isn’t an option to access the binary logs for CDC. With Aurora Serverless there is the option of using database triggers but they bring their own problems. Before hand-cranking database triggers, try out the free version of Data Nessie and see if it meets your requirements!