Introduction
The main objective of this project was to construct organization profiles using the available information in the internet. It utilizes various web scraping and information extraction components to support that. Following are the components in lead generation pipeline.
- Google search result scraper
- Crunchbase web profile filter and extractor
- Open Corporates web profile extractor
- D&B web profile extractor
- Avention web profile extractor
- Google address extractor
- Google telephone number extractor
- Google CEO/MD extractor
- Website contact page extractor
- Linkedin information extractor through Google
- Owler Q&A extractor through Google
- Deep crawler on website
Framework
Following figure depicts lead generation pipeline.
Technologies and areas
Azure services(Virtual Machines, Storage queues), MongoDB, Python, Flask, Web Crawling, HTML, NLP(keyword/topic extraction, text clustering and classification)
Team
Nishan Mills, Gihan Gamage(me)