Setting up Django with Scrapy

This guide is about using Django, the most popular Python web framework, and Scrapy, the most popular Python scraping framework. Both of the frameworks are awesome, and they work very well standalone.

Before you continue reading, make sure you are already beyond “Getting Started” stage for both the frameworks.

At the end of the guide, what you can achieve is:

Run scrapy, and auto save the crawled items in Django ORM

1) Scrapy’s settings.py

def setup_django_env(path):
  import imp, os
  from django.core.management import setup_environ
 
  f, filename, desc = imp.find_module('settings', [path])
 project = imp.load_module('settings', f, filename, desc)       
 
 setup_environ(project)
 
  # Add django project to sys.path
  import sys
  sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))
 
setup_django_env('/path/to/django/myproject/myproject/')

2) Scrapy’s items.py

from scrapy.contrib_exp.djangoitem import DjangoItem
from myapp.models import Poll
 
class PollItem(DjangoItem):
    django_model = Poll

3) Scrapy’s pipelines.py

from myapp.models import Poll
 
class PollPipeline(object):
 
    def process_item(self, item, spider):
 
      item.save()
        return item

Done!

That’s all to run scrapy and auto save the items to Django ORM. You can now run your regular

scrapy crawl myspider

PS: This guide serves to be complete. It adds to a popular Stackoverflow answer, and completes the picture for Django 1.4, which Django adopts a new layout. And also provide the code for the experimental DjangoItem (rare!).

Share and Enjoy
    Tagged with: ,
    Posted in Development, How-to, whatever
    • Emanuel

      Thank you, this guide was really great. Somehow I could not figure out how to actually import the DjangoItem package so this is just what I needed, actual code to see what I was doing wrong.

    • Goran K

      ImportError: No module named models
      For some reason I can’t import django app to items.py even if I add path to the app in sys.path

    • Zubair

      from django.core.management import setup_environ will not work in current django version…then how to do this.