Data Source Diff
Overview
Data Source Diff allows you to compare entire data sources or specific tables between different database instances. This is particularly useful for validating data migrations or ensuring consistency across environments.
Key Features
- Bulk Comparison: Compare multiple tables in one operation
- Filtering Options: Apply filters to focus on specific subsets of data
- Summary Diff: Get a high-level summary of differences between source and target
- Meta Diff: Compare table structures and metadata across data sources
- Data Diff: Verify row-level data consistency between source and target
- Export Diff Report: Export results as PDF/CSV/JSON for sharing or further analysis
Video Demonstration
How to Use SmartDiff Data Source Diff
Step 1: Access SmartDiff
- Select Smart Diff from the left panel
- Click Workflow to view workflow history
Step 2: Create New Diff Workflow
- Click CREATE DIFF
- By default, DATA SOURCE DIFF is selected
- Choose Source and Target data sources (can be the same or different)
- Click Next
Step 3: Select Items
On this page, you select the items to compare. Options available:
| Feature | Description |
|---|---|
| AUTO MAP | Automatically maps items based on similarity. Low-similarity items are skipped. |
| Transform & Filtering | Apply transformations or exclude specific columns (e.g., exclude timestamp columns for better accuracy). |
| SmartDiff Configuration | Configure settings: enable/disable summary diff, choose summary type (count or range), set batch size, enable parallel/sequential execution, and define storage (S3/Azure Blob). |
| Filter Columns | Exclude unnecessary columns or define custom keys for comparison. |
| SCHEDULE (Coming Soon) | Schedule diffs to run at specific times to reduce system load. |
Once ready:
- Review items from both source and target (schemas, tables, etc.)
- Map corresponding items
- Click Next
Step 4: Configure Key Column
note
If a primary key exists, this step is skipped.
- Select a key column to map records (e.g.,
IDcolumn for cost tables) - Click Proceed
Step 5: Review Results
tip
The diff runs asynchronously. You don’t need to wait — check results once the process finishes.
- View live progress with completion percentage
- Click View Diff to explore detailed reports
Step 6: Analyze Detailed Report
Diff Overview
Shows high-level statistics:
- Diff Columns: Number of columns with differences
- Diff Rows: Number of rows with differences
- Same Rows: Number of identical rows
- Rows in Source: Total rows in source
- Rows in Target: Total rows in target
- Missing Rows in Target: Rows present in source but missing in target
- New Rows in Target: Rows present in target but not in source
Summary View
High-level count differences without full row-by-row comparison:
| Option | Description |
|---|---|
| ALL DATA | Shows differences for all columns |
| ONLY DIFF | Shows only columns with differences |
| BY SOURCE | Compare frequency counts from source to target |
| BY TARGET | Compare frequency counts from target to source |
| Graphs | Visual differences based on column type: • Numeric: Range-based counts • Date: Monthly frequency comparisons • String/Other: Frequency of unique items |
| Export | Export summary results as CSV/PDF |
Meta Diff
Compares metadata (schema and column properties):
- Column Name: Lists all compared columns
- Property Name: Metadata property (e.g., datatype, length)
- Source Value / Target Value: Shows values from each system, highlighting differences in red (source) and green (target)
- Export: Save metadata diff as CSV/PDF
Data Diff
Row-level data comparison with cluster-based grouping:
| Feature | Description |
|---|---|
| Clusters | Rows grouped into clusters (sorted by most differences) |
| ONLY DIFF | Show only rows with differences |
| ALL DATA | Show all rows |
| Side by Side View | Compare source vs target in table format (color-coded: red = source diff, green = target diff) |
| Inline View | Highlight inline differences between values |
| Columns to Hide | Hide non-relevant columns |
| Export | Export data-level diffs as CSV/PDF |
Comparison Modes
- Full Comparison: Compare all rows and columns
- Sample Comparison: Compare a representative sample (faster, less resource-intensive)
- Key-based Comparison: Compare based on primary/custom keys
Best Practices
tip
- Schedule comparisons during off-peak hours for large datasets
- Use filtering to reduce load and focus on critical data
- Save configurations for recurring validations
- Review summary results before checking detailed diffs
Troubleshooting
warning
- Connection Timeouts: Check network/database connectivity
- Permission Issues: Ensure read access to required tables
- Performance Issues: Use sampling for very large tables