Skip to main content

Data Source Diff

Overview

Data Source Diff allows you to compare entire data sources or specific tables between different database instances. This is particularly useful for validating data migrations or ensuring consistency across environments.


Key Features

  • Bulk Comparison: Compare multiple tables in one operation
  • Filtering Options: Apply filters to focus on specific subsets of data
  • Summary Diff: Get a high-level summary of differences between source and target
  • Meta Diff: Compare table structures and metadata across data sources
  • Data Diff: Verify row-level data consistency between source and target
  • Export Diff Report: Export results as PDF/CSV/JSON for sharing or further analysis

Video Demonstration


How to Use SmartDiff Data Source Diff

Step 1: Access SmartDiff

  • Select Smart Diff from the left panel
  • Click Workflow to view workflow history

Step 2: Create New Diff Workflow

  • Click CREATE DIFF
  • By default, DATA SOURCE DIFF is selected
  • Choose Source and Target data sources (can be the same or different)
  • Click Next

Step 3: Select Items

On this page, you select the items to compare. Options available:

FeatureDescription
AUTO MAPAutomatically maps items based on similarity. Low-similarity items are skipped.
Transform & FilteringApply transformations or exclude specific columns (e.g., exclude timestamp columns for better accuracy).
SmartDiff ConfigurationConfigure settings: enable/disable summary diff, choose summary type (count or range), set batch size, enable parallel/sequential execution, and define storage (S3/Azure Blob).
Filter ColumnsExclude unnecessary columns or define custom keys for comparison.
SCHEDULE (Coming Soon)Schedule diffs to run at specific times to reduce system load.

Once ready:

  • Review items from both source and target (schemas, tables, etc.)
  • Map corresponding items
  • Click Next

Step 4: Configure Key Column

note

If a primary key exists, this step is skipped.

  • Select a key column to map records (e.g., ID column for cost tables)
  • Click Proceed

Step 5: Review Results

tip

The diff runs asynchronously. You don’t need to wait — check results once the process finishes.

  • View live progress with completion percentage
  • Click View Diff to explore detailed reports

Step 6: Analyze Detailed Report

Diff Overview

Shows high-level statistics:

  • Diff Columns: Number of columns with differences
  • Diff Rows: Number of rows with differences
  • Same Rows: Number of identical rows
  • Rows in Source: Total rows in source
  • Rows in Target: Total rows in target
  • Missing Rows in Target: Rows present in source but missing in target
  • New Rows in Target: Rows present in target but not in source

Summary View

High-level count differences without full row-by-row comparison:

OptionDescription
ALL DATAShows differences for all columns
ONLY DIFFShows only columns with differences
BY SOURCECompare frequency counts from source to target
BY TARGETCompare frequency counts from target to source
GraphsVisual differences based on column type: • Numeric: Range-based counts • Date: Monthly frequency comparisons • String/Other: Frequency of unique items
ExportExport summary results as CSV/PDF

Meta Diff

Compares metadata (schema and column properties):

  • Column Name: Lists all compared columns
  • Property Name: Metadata property (e.g., datatype, length)
  • Source Value / Target Value: Shows values from each system, highlighting differences in red (source) and green (target)
  • Export: Save metadata diff as CSV/PDF

Data Diff

Row-level data comparison with cluster-based grouping:

FeatureDescription
ClustersRows grouped into clusters (sorted by most differences)
ONLY DIFFShow only rows with differences
ALL DATAShow all rows
Side by Side ViewCompare source vs target in table format (color-coded: red = source diff, green = target diff)
Inline ViewHighlight inline differences between values
Columns to HideHide non-relevant columns
ExportExport data-level diffs as CSV/PDF

Comparison Modes

  • Full Comparison: Compare all rows and columns
  • Sample Comparison: Compare a representative sample (faster, less resource-intensive)
  • Key-based Comparison: Compare based on primary/custom keys

Best Practices

tip
  • Schedule comparisons during off-peak hours for large datasets
  • Use filtering to reduce load and focus on critical data
  • Save configurations for recurring validations
  • Review summary results before checking detailed diffs

Troubleshooting

warning
  • Connection Timeouts: Check network/database connectivity
  • Permission Issues: Ensure read access to required tables
  • Performance Issues: Use sampling for very large tables