Setting up Django with Scrapy

This guide is about using Django, the most popular Python web framework, and Scrapy, the most popular Python scraping framework. Both of the frameworks are awesome, and they work very well standalone.

Before you continue reading, make sure you are already beyond “Getting Started” stage for both the frameworks.

At the end of the guide, what you can achieve is:

Run scrapy, and auto save the crawled items in Django ORM

1) Scrapy’s settings.py

def setup_django_env(path):
  import imp, os
  from django.core.management import setup_environ
 
  f, filename, desc = imp.find_module('settings', [path])
 project = imp.load_module('settings', f, filename, desc)       
 
 setup_environ(project)
 
  # Add django project to sys.path
  import sys
  sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))
 
setup_django_env('/path/to/django/myproject/myproject/')

2) Scrapy’s items.py

from scrapy.contrib_exp.djangoitem import DjangoItem
from myapp.models import Poll
 
class PollItem(DjangoItem):
    django_model = Poll

3) Scrapy’s pipelines.py

from myapp.models import Poll
 
class PollPipeline(object):
 
    def process_item(self, item, spider):
 
      item.save()
        return item

Done!

That’s all to run scrapy and auto save the items to Django ORM. You can now run your regular

scrapy crawl myspider

PS: This guide serves to be complete. It adds to a popular Stackoverflow answer, and completes the picture for Django 1.4, which Django adopts a new layout. And also provide the code for the experimental DjangoItem (rare!).

TAGGED WITH

django, python