This guide is about using Django, the most popular Python web framework, and Scrapy, the most popular Python scraping framework. Both of the frameworks are awesome, and they work very well standalone.
Before you continue reading, make sure you are already beyond “Getting Started” stage for both the frameworks.
At the end of the guide, what you can achieve is:
Run scrapy, and auto save the crawled items in Django ORM
1) Scrapy’s settings.py
def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module('settings', [path]) project = imp.load_module('settings', f, filename, desc) setup_environ(project) # Add django project to sys.path import sys sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir))) setup_django_env('/path/to/django/myproject/myproject/') |
2) Scrapy’s items.py
from scrapy.contrib_exp.djangoitem import DjangoItem from myapp.models import Poll class PollItem(DjangoItem): django_model = Poll |
3) Scrapy’s pipelines.py
from myapp.models import Poll class PollPipeline(object): def process_item(self, item, spider): item.save() return item |
Done!
That’s all to run scrapy and auto save the items to Django ORM. You can now run your regular
scrapy crawl myspider |
PS: This guide serves to be complete. It adds to a popular Stackoverflow answer, and completes the picture for Django 1.4, which Django adopts a new layout. And also provide the code for the experimental DjangoItem (rare!).