Scalable Scraping Using Machine Learning


Eddie Bell & Jonathan Heusser

@ejlbell & @jonathanheusser

Fashion e-commerce, on site checkout,
£20m sales, 1000% growth,
12 million products, 300+ retailers.

Introducing:

The Data Team

Provide one consistent, up to date product catalog.

With correct prices, stock, images, etc.


Only one way to achieve this:

A robust scraping pipeline

How To

Robust scraping

Spiders that do not break easily

Quick detection and fixing of broken spiders

Ensure a consistent high quality of data

Spiders

Every retailer has at least one spider

All based on Scrapy

Large library of custom extensions, middleware and mixins

CSS selectors for extraction

Your Friendly Neighbourhood

Spidermen

The remote maintenance team

Example Spider Output


{
 'code': 161143633,
 'color': 'black',
 'description': 'Jil Sander Runway Black Leather Boot with Chunky Heel',
 'designer': 'Jil Sander',
 'gbp_price': '1150.00',
 'gender': 'F',
 'image_urls': ['...']
 'link': 'http://www.edonmanor.com/collections/jil-sander/products/jil-sander-runway-boot',
 'name': 'Jil Sander Runway Boot',
 'raw_color': 'black',
 'sale_discount': 0,
 'stock_status': {'36.5': 3, '38': 3, '39': 3, '40': 3},
 'subtype': 'B',
 'type': 'S'
}
Structure

Spider Anatomy

Spiders are based on the template pattern.


def item_code():
    ... 
def item_gender(): 
    ...
def item_type(): 
    ...
def item_color(): 
    ...
def item_images(): 
    ...

# eval all item_* methods and yield a dictionary of results
for method in ['item_code', 'item_gender', ...]:
    item[method] = method(response)
yield item

Core principal of spider development are mixins:
e.g prices, colors, data-sources & classifiers.

Feeds

A site redesign is one cause of spider failure

Feeds give us structured data

Architecture for processing and normalisation

start_url HELL

Type Classifier

Chained set of SVMs trained on sparse binary text features

99% Accuracy

#magic

Type Classifier Example


type_classifier.predict('Chloe Green Suede Gold Trim Pump with Chunky Heel')

Out[1]$ {
    'gender': 'female',
    'type': 'shoes',
    'subtype': 'heels'
}
#moremagic

Type Classifier Example


type_classifier.predict('Everyday Shopper')

Out[1]$ {
    'gender': 'female',
    'type': 'bags',
    'subtype': 'totes'
}

Auto Classifier

Moderators and the problem with humans

Autoclassifier (linked with feed architecture)

SGD on 5m x 5m sparse matrix


autoclassifier.predict('With a focus on a well-tailored silhouette, Badgley Mischka crafts \
                        an understated and statuesque gown flaunting a bow-accented v back. \
                        Polyester/viscose. Dry clean. Imported. Boat neck, sleeveless, \
                        softly pleated shoulders. V back, bow detail, concealed side zip')
Out[1]$ {
 'confident': True,
 'gender': {'confident': True, 'prediction': 'female'},
 'type':   {'confident': True, 'prediction': 'clothing'}}
 'cat':    {'confident': True, 'prediction': 'dresses'},
 'subcat': {'confident': True, 'prediction': 'gowns'},
 'color':  {'confident': False, 'prediction': ''}
}
The Sex Panther Method

Auto Classifier

60% of the time, it works every time.


The toilet brush incident

Auto Classifier Dangers

Elementary, My dear

Diagnostics and Databot

1, 2, 3 ..

Measure everything

Spiders send scrape statistics to graphite

Broad categories: exceptions, items scraped, scrape duration, etc.

Query graphite and store it in Pandas

Run statistics and basic timeseries analysis

Error categories

Deciding what is broken is hard. Focus on that not what.

Diagnostics

Number of Items Scraped

BEEP BEEP BOOP

Trello and Databot


Human Computer Loop

Humans are great feature detectors for what is broken

Computers good at detecting that it's broken

Business has insight into spider fixes, increases communication

Increase & decrease sensitivity to generate more or less work

Going German

Quality Control

Working spiders update our DB at high frequency

Need to keep quality of prices and stock status' high


Analyse Streams of Items

Price Protection

Build price distributions: all designers, currency, and types

Check likelihood of new price to come from given distribution

Reject or update parameters in streaming fashion


~ 15'000 Distributions in Memory

Acne Shoes


Acne Accessories



Thank You