How to create a new Data Model (by Michael Groeneman, 1/24/11)

A Data Model consists of the following four things:
1. A set of fields, detailing the attributes of the data model.
2. A scrape method, that acquires raw data from a data source.
3. A parse method, that parses the raw data and returns a dictionary of the form {atrribute:value,...}
4. An __identifier__ variable, which specifies the identifier of the DataModel,
	which should be an identifying value unique to each instance of the particular DataModel subclass. 
	ZIP code for Weather data, or ticker symbol for Stock data are good examples.  It is strongly recommended, 
	though not required, that the scrape method be able to get correct data from the data source using only the
	identifier.

We proceed to describe each of these in turn.

--- Fields ---

Fields are classes which encapsulate metadata about an attribute.  Examples of such metadata include the human-readable name of an attribute, it's minimum, maximum, and default values, etc...  As of this writing, the following fields are available:

- StringField(name,maxLen=100,default="")
	Details an attribute of type basestring.  maxLen is inclusive.
	
- EmailField(name,default="")
	A subclass of the StringField Field, which only accepts valid email addresses, or the empty string.  Maximum length is 100 characters.
	
- IntField(name,minVal=-2147483648,maxVal=2147483647,signed=True,default=0)
	Details an attribute of type int.  The min and max values can be set as appropriate, but default to the min and max values of the PostgreSQL and Python integer types.  If signed is false, only positive numbers are accepted, even if minVal<0.

- LongField(name,minVal=-9223372036854775808),maxVal=9223372036854775807),signed=True,default=long())
	Similar to IntField, but can accept larger values, as the underlying type is long instead of int.
	
- DecimalField(name,precision=7,minVal=Decimal(-1000000000000),maxVal=Decimal(1000000000000),signed=True,default=Decimal(0))
	Stores a decimal (exactly, there should be no rounding error).  Can specify precision, minVal, maxVal, signed, and default

- DateTimeField(name,
				minDate=datetime.datetime(datetime.MINYEAR,1,1),
				maxDate=datetime.datetime(datetime.MAXYEAR,12,31),
				default=datetime.datetime(datetime.MINYEAR,1,1))
	Stores a date and time.
	
- TimeField(name,default=datetime.time(0))
	Stores a time.  Defaults to midnight.
	
More fields to come!

--- Identifier ---

The identifier is a (preferably [or perhaps absolutely?] unique) field that is used by the scraper to get the raw data when updating, and for a few other things.  It MUST be set when the DataModel instance is constructed.  It is a string that corresponds to the variable name of a field instance.  For example, if we have the field 
	
	zipCode = IntField("ZIP Code",minVal=0,maxVal=99999,signed=False)

then, we could have the following identifier assignment
	
	__identifier__ = "zipCode"

--- Scrape ---

The scrape method takes no arguments, and returns raw data provided by the data source.  Ideally, the scrape method should use only the identifier, because other fields may not be set before the first scrape, or may be set to defaults that would not get the correct raw data if passed to the data source.  Ignore this at your own peril.  The identifier is the ONLY attribute that is required to have been set at construction time.

Scrapers are also advised to prepare for the possibility of an invalid identifier.  While the DataModel infrastructure guarantees that the identifier will be of the correct type (and to do some basic checks), it doesn't have a 100% reliable way to make sure the identifier is valid.  For example "XXXX" is in the valid format of a stock ticker, but there isn't a stock represented by that ticker symbol.  Know what your data source will do in this situation, and be prepared to respond by raising UnknownIdentifier when the situation occurs.  To help you with this, consider overriding the _validIdentifier method, which takes a candidate identifier and returns True only if the candidate is valid.  For example, the _validIdentifier method for the Stock shown below makes sure the ticker symbol contains only alphanumeric characters, periods, and dashes:

	def _validIdentifier(self,ticker):
		''' Ensures ticker symbol only contains valid characters (A-Z,a-z,0-9,.,-)
			http://stackoverflow.com/questions/1323364/in-python-how-to-check-if-a-string-only-contains-certain-characters
		'''
		if isinstance(ticker,basestring):
			search = re.compile(r'[^a-zA-Z0-9.-]').search(ticker)
			return not bool(search)
		else:
			return False
			
Of course, this doesn't guarantee that the ticker symbol actually exists, but it prevents obviously bad things from happening.

--- Parse ---

The parse method takes one argument, which is the raw data collected by the scrape method.  Parse is expected to return a dictionary of the form {attributeName:attributeValue}, where attributeName is a string representing the name of one of the fields, and attributeValue is of the type specified in the field.  Note that no type casting is performed by the DataModel method that calls parse, so any necessary casting will take place in parse.  A lot of tedious busywork takes place here, so that it doesn't have to be done later.

**** Other Things of Note ****

-The .sql method: The DataModel class has a .sql() method that returns the (Postgre)SQL needed to generate a table for the data model.  Sysadmins can use this to generate the database tables needed to support a particular DataModel.