Fri 24 Mar 2006
This is an interesting read (pulled from my delicious links) on database optimization, and the downsides of normalization:
Normalized data is for sissies
The article links to a pdf presentation given by Cal Henderson, who helped create Flickr. A quick snippet:
In Flickr’s case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.
One Response to “Database optimization”
Leave a Reply
You must be logged in to post a comment.


March 27th, 2006 at 12:21 pm
This *is* an interesting read on database optimization: the downsides of normalization and the upsides of denormalization. It is also an interesting discussion on the downsides of denormalization and the upsides of normalization.
On the other hand, you could summarize it as an example of multiple perspectives on multiple dimensions that make visible a juncture - an intersection of functions - at which point decisions are made about what to optimize and how to optimize depending on the local configurations of hardware, software, functional requirements, personnel, and timeframes.
At this milestone - this stumbling block of a moment - we have the opportunity to step up to new insights by paying close attention while we are defining particular issues and local solution(s). Seems that somewhere between the long-term managerial caution of a normalizing software architect and the fire-drill response of a results-driven tool-wielding denormalizing programmer, an OI balance-of-the-moment will be created. And we’ll be learning whether a extreme programming pragmatic approach helps bridge the ideal and the practical, that is, theory and practice. In collaboratively joining the collecting of streamed data and the delivery of cruise data through the concreteness of strategic database design sessions, we have shared diagrams that effectively bridge that which traditionally is resident initially in memory of a few technical professionals and that disappears into undomumented local functionality over time. The existence of two OI database schema makes evident a disjuncture in function -acquisition and query - and a difference in performance criteria dealing with time - real-time versus post-study.
Seems the rich heterogeneity of experiences in the online discussion highlights a variety of issues. I’ll try to capture and order some of them below.
1. Normalization
-enforces data integrity by reducing redundancy (ensures no conflicting data)
-increases flexibility
-improves efficiency of data maintainability (data synchronization issues minimized)
-improves ability to change database structure
-the ideal is captured in 5 normal forms (NF) of which the pragmatic level is often 3NF
-indexing and stored procedures help as scaling occurs
-stored procedures can improve efficiency of response time by large pre-defined sets of queries
-views
-caching may help with read time responses to multiple simultaneous accesses
2. Denormalization
-selective denormalization is a design decision (best practice begin with normalization)
-introduces redundancy (data duplication) improves efficiency of a particular data access at the cost of introducing replication and referential integrity (makes the schema dependent on useage)
-redundancy carries higher cost and is more problematic to maintain over time
-duplicate data brings introduces need for attention to data synchronization to maintain consistency of the data
-may be achieved running a de-normalized mirror of your primary database just for searching
And that it is important to distinguish carefully ‘optimization of what’ and ‘by what measure’ as well as whether we’ve got the issues in focus. As Shaun mentioned, normalization is what’s taught because it’s the underlying nature (theoretical ideal expressed in rules or normal forms) of database design (that is spatially efficient?). With both our schema are fairly normalized. what about the other pertinent elements, ie generalizatity of content and application characteristics, not to mention the design process and performance notes?
3. Generalization
The generalization of content refers to transforming a single table by removing a column and making 2 tables. A red flag here is not wrt the db schema but wrt who wants to input and/or tend relations.
-the acquistion schema, optimized for inputs, is more general (ie atomized relations by keys; split)
informal example query: select product from measurement, mcollection where .mcollection_id = mcollection + .mcollection_id and attribute name equals chlorophyll order by time
-the query schema, optimized for use cases, is less general (ie relations grouped by tables; lumped)
informal example query: select chlorophyll from sample order by time
4. Application characteristics
-mySQL does not have stored procedures or views
-do cpu time, caching and diskspace come into play or are they unproblematic given low costs today
-other!?
5. Design process
-identify performances of concern for optimization: data capture, integrity, query types, access time, maintainability over time, change of application over time
-initiate processes of identifying and gathering the data and metadata - to have something to work with in practice
-query types can be identified through use case scenarios and frequent mini user-demos
-normalize first, then test quantitatively, and then de-normalize as needed, and continue iteratively (ie best practice: avoid premature optimization)
6. Some side notes
-performance is affected by whether the search is by primary key or free text
-performance depends on the database application and how select queries are implemented (MySQL ‘UNION ALL’ adds significant overhead and the time taken for the query increases)