2x to Processing Speed: Address Data Processing System Optimization
The client managed an address parsing system using data from the Address Information System (AIS) and Topologically Integrated Geographic Encoding and Referencing (TIGER). Their setup relied on MS SQL Server and SSIS, combined with stored procedures, C# assemblies, and regular expressions for parsing, verifying, and cleansing data.
Challenges Our Client Had
The client faced these bottlenecks and inefficiencies that made scaling and maintaining the system increasingly difficult:
- Performance bottlenecks. Processing over a million records caused slowdowns, as MS SQL Server and SSIS couldn’t handle the workload.
- Inconsistent data flow. Response times were unpredictable, disrupting workflows and data delivery timelines.
- Duplicate data. Monthly data updates contained 80–90% duplicates. Thus, they encountered wasted storage and processing power.
- Strict matching algorithms. The system failed to handle minor errors in input, leaving valid addresses undetected.
- Address format variations. The system treated all addresses the same, causing issues with non-standard formats, like those in Puerto Rico.
- Technology limitations. The existing stack couldn’t keep up with growing data demands. This increased system complexity and maintenance efforts.
Solutions Implemented
We tackled these issues with a combination of performance analysis, architecture upgrades, and algorithm improvements.
Performance Analysis
Using JetBrains DotTrace, we pinpointed bottlenecks in MSSQL queries and parsing algorithms and optimized them for faster execution.
Architecture Overhaul
We migrated data storage to Redis, which improved read/write speeds and allowed the system to handle larger datasets.
Advanced Parsing Algorithms
Six new parsing algorithms were developed, including one specifically for Puerto Rican addresses, to handle input variability and increase accuracy.
Caching and Deduplication
An internal caching system reduced redundant operations. We also used Hadoop for deduplication and Apache Solr for indexing, making data retrieval faster.
Continuous Monitoring
Regular testing with JetBrains DotTrace ensured optimal performance and allowed real-time adjustments.
Technologies Used
- Redis
- MS SQL
- C#
- Hadoop
- Apache Solr
- JetBrains DotTrace
Results
- 50% faster processing.
- 40% traffic reduction due to structured directories.
- The system handles over a million records fast, with room for growth.
- Enhanced efficiency through caching mechanisms.
- Streamlined new data loading by eliminating duplicates.
- New algorithms detect addresses even with minor input errors.
Discover more here: https://www.intsurfing.com/big-data-projects/address-processing-system/
Don’t want to miss anything?
Subscribe to keep your fingers on the tech pulse. Get weekly updates on the newest stories, case studies and tips right in your mailbox.