November 19, 2024
•
2 min read
Developing Nannostomus for Efficient Web Data Extraction
Intsurfing identified a growing need among businesses to extract structured data from websites for analytics and decision-making. Recognizing that many organizations lacked the technical expertise or resources to build custom web scrapers, Intsurfing embarked on creating a versatile, user-friendly solution.
Challenges
- Uneven Load. Previous scraping setups failed to maximize AWS resource utilization. Virtual machines often operated inefficiently, with some EC2 instances underutilized while others were overloaded. This imbalance increased operational costs, particularly when scaling up.
- Request Limits and IP Bans. Scraping activity often triggered request limits or IP bans, disrupting data collection. Mitigating these risks required integrating proxies, rotating IP addresses, and introducing delays between requests to mimic human browsing behavior.
- High Database Load. When too many scrapers simultaneously saved data, the system faced conflicts due to parallel updates to the same records. This caused database performance issues and data integrity risks. To address this, we implemented Optimistic Locking, ensuring version control and seamless parallel writes.
- Scalability. Designing a system capable of handling large-scale data extraction tasks without compromising performance.
- Cost Efficiency. Ensuring the solution remained affordable, even when processing vast amounts of data.
- User Accessibility. Developing a tool that users with minimal programming skills could operate effectively.
Solutions Implemented
- Microservices Architecture. Implemented a microservices framework to allow independent updates and maintenance of system components without affecting overall operations.
- Mediator Component. Developed a serverless Mediator application to interface between the user console and backend services, streamlining task management and execution.
- Work Balancer (WBalancer). Created WBalancer to distribute tasks evenly across virtual machines (NWorkers), optimizing resource utilization and preventing overloads.
- Worker Registry (WRegistry). Established WRegistry to manage the lifecycle of NWorkers, automating startup, shutdown, and health checks to ensure system readiness.
- Code Repository (CRepository). Set up CRepository as a serverless application for building and deploying specialized software packages (CRunner images) tailored to specific data extraction tasks.
- NWorker and CRunner Components. Developed NWorker to manage CRunner instances within virtual machines.
Technologies Used
- AWS, GCP, Azure. Integrated with major cloud platforms to provide flexibility and scalability.
- Docker. Utilized for containerization of CRunner images, ensuring consistent deployment across environments.
- C#. Employed for developing NWorker and CRunner applications.
Results
- High Scalability. Nannostomus handles large-scale data extraction tasks, distributing workloads to prevent bottlenecks.
- Cost-Effective Operations. The system achieves low operational costs, with data extraction costs as low as $0.0001 per record in certain projects.
- User-Friendly Interface. Designed for users with basic programming skills, enabling broader accessibility for data extraction needs.
- Reliable Performance. The microservices architecture and robust component management ensure consistent and reliable system performance.
Read more here: https://www.intsurfing.com/big-data-projects/nannostomus/
Don’t want to miss anything?
Subscribe to keep your fingers on the tech pulse. Get weekly updates on the newest stories, case studies and tips right in your mailbox.