Phase 2
Page outline
Introduction
In the first phase you have gotten a first impression of the problems encountered when trying to extract useful information from formats that were not designed for this purpose. Clearly the standard HTML format is not at all suitable for automatic processing, which is one of the reasons why standards like XHTML have been designed. Of course the fact that many web shops don't want their data to be automatically gathered doesn't help either.
Now that you all have data sets in a "normalized" form (actually three data sets in three different forms), you get to know another major challenge in information systems -- merging inhomogeneous data. As you can imagine this is a very frequent problem in information technology today, just think of corporate mergers, distributed information systems, etc. Whenever data from different sources come together, a way must be found to recognize which records actually describe the same data, but are just in a different format. Unique identifiers like the ISBN for books are unfortunately the exception, so different solutions must be found.
The first part of the second phase of the project is all about this problem. Each team has three different databases at hand, their own and those from the other two teams in the consortium. The goal is to merge these three databases to a single, more comprehensive one in order to have more interesting data for the "web shop" that you are going to design in the second part.
Step 1: Exchanging data
Before you can start with the real work of phase 2 all teams must exchange their databases within their consortia. How you do this is completely up to you. The only rules we make is that you must give your consortium partners access to your original and complete database. This includes the database schema and the actual records in an SQL format suitable for the Postgres servers we are using for the course.
You'll probably want to use the pg_dump tool for this. It lets you extract the entire database in one swoop and save it in a file. Here is an example of how you can dump your database to a file:
pg_dump -Oxd MyDatabase -f mydatabase_dump.sql
For more information you can consult the pg_dump reference or enter pg_dump --help.
Deadline for database exchange: Wednesday, 2005-04-27, 2400
Step 2: Importing the data
Now that you have received the databases dumps from the other two teams you should probably start by reimporting them on your server. The best way to do this is to first create two databases and then importing one dump in each of them.
Step 3: Analyzing the data
Analyze the schemas (possibly using the information found in the reports of the other teams -- see the report list) and decide what information you want to add to your own database. Think ahead what kind of information you're going to need for your "web shop". One obvious example are prices; if you have chosen music databases for your category you probably won't have any pricing information, so you can get this from one of the other databases. Try to identify the fields that can help you link the data from two databases.
It is likely that you have to do certain changes to the databases of your fellow students. You may want to restructure the tables to make them more similar to your own schema.
Step 4: Extending your schema
Obviously the schema developed in phase 1 has primarily been designed for the specific needs of your chosen category. Now that you know what kind of data you want to add you are able to extend your schema to suit the new circumstances. There are several ways to accomplish this, here are just two possibilities:
Alter your current schema by adding the required fields and tables. By doing this you lose of course your original schema but the schema becomes more homogeneous and easier to query.
Leave your current schema as it is and add new tables for the data you use introducing linking tables where necessary. You can then create views using the CREATE VIEW SQL command. The advantages of this method are obvious -- there's no need to change existing work, you can simply add new things, which is often much easier to do. Or to quote the Postgres manual: Making liberal use of views is a key aspect of good SQL database design.
If you have a lot of data (i.e. tens or hundreds of thousands), you may want to read this very nice article called Materialized Views in PostgreSQL. It uses some advanced techniques to augment performance of views on large data sets. A very interesting read if you like database technologies!
Clearly mixtures of the two approaches are possible and may make sense in certain cases. The main point of this step (and this is also the criterion we're going to use to grade your work) is that your choice makes sense in the specific circumstances you find yourselves in. Your solution should take into account the basic design principles of relational database design.
A technical note: Once you've analyzed the two foreign databases you might consider reading all three data sets into a single database for the merging process. If you have colliding table names in the three databases you may find Postgres' schema concept useful.
Step 5: Merging the data
There exist many ways to merge the data, once again depending on the specific data sets you have. Here are few general remarks that you may find useful:
Sometimes it is necessary to clean up the data before merging it. This may include removal of characters, replacement of partial strings, case conversion, etc. As an example take a CD named "Devils & Dust" in the first database and "Devils and dust" in the second one. Clearly if you try to link the two tables using a standard SQL join it won't work.
If you decide to do string processing as mentioned above a good way to do this is to add an extra field that first contains an exact copy of the original field (e.g. Title could be copied to ProcTitle -- up to you to find a better name ;-). The string processing can then only be done on the copy. This makes sure that you don't harm any original data. How you introduce the copy is again up to you, basically the same possibilities apply as for step 4.
String processing can be carried out in many ways. You could write a small Java application that uses JDBC for this purpose or you can make use of Postgres' build-in functions for string matching and processing. Here are some useful links: SQL String Functions and Operators, Other String Functions, Pattern Matching
As an alternative to preprocessing the strings and then joining them using SQL you might also consider writing a program in Java or any other language you prefer that does the actual merging process. Really there is no one way to do it.
As a general rule of thumb for the merging part there is no perfect solution but it is more of a "best effort" kind of approach. You should really try to get as much data as you can. This makes it much more interesting for the rest of the project, plus it increases your chances in the bidding game at the end. The more articles you can offer the more attractive your site becomes. Note that there is no upper limit to the number of articles that you can offer!
A note on currencies
The articles that you have gathered in phase 1 may have prices in different currencies like CHF, EUR, USD, etc. To simplify things and to make sure that prices remain comparable for the remaining parts of the project you are required to convert all prices from foreign currencies to CHF.
You must use the following exchange rates of 2005-04-18:
| Foreign currency | CHF |
| 1 EUR | 1.5459 |
| 1 GBP | 2.2602 |
| 1 USD | 1.1870 |
If you can't find your currency in this list, please let us know as soon as possible.
While we are using fixed rates for the purpose of this project, take a moment to think about how you would attack such a problem in real life with exchange rates varying from one day to another.
If your team is affected by the currency problem, the way how you solve it will be evaluated. There is no need no implement something very fancy (especially given our fixed rates), nevertheless some solutions are more suitable than others.
If you happen to be in a consortium of teams with equal currencies (or did not encounter the currency problem for some other reason that you should describe) explain in a few sentences how you would have solved the problem.
There is a chance you may not have prices for all articles at the end of the merging process. For example you may have collected much more data than the other two consortia members or you may not be able to match as many records as you would like. If this happens you are encouraged to make up prices where necessary. Try to find a creative way to do this that gives you reasonable prices. The elegance of your solution will be graded, so you should avoid simply replacing all missing values by 10 CHF. Also keep in mind that you can use these prices as a minimum selling price (i.e. the start price of a bid) for the last phase.
Step 6: Creating the Web shop
Once you have a solid database with a number of interesting articles you need to develop a web application that enables the other teams to browse through your offering and search for specific articles. Here are the two functions that your web shop must offer for now:
Browsing for articles: It must be possible to browse for music by at least one criterion. Examples for this would be alphabetic by artist or by genre. Maybe you can think of other criteria, use your imagination ...
Searching for articles: It must be possible to search for music by entering (case-insensitive) search phrases for artist and album title. As an example the search for "persuasion" could yield the album "Elements Of Persuasion" by James LaBrie.
Of course you are free to add more features to it if you're after bonus points. Here are a few ideas if you're motivated but don't know where to get started:
- Phonetic search (see the section on fuzzy string matching in the Postgres how-to)
- Advanced search options (e.g. search by price range or other attributes)
- Search refinement (i.e. adding more criteria to refine the currently displayed search results)
You will have to add one particular feature later in phase 3, namely the possibility for visitors to bid for articles on the spot. There is nothing you have to do about this during this step, nevertheless you might want to keep it in mind while you're designing your application.
To accomplish this task you will be using the Java application server J2EE and JavaServer Pages. All you need to know on setting up and using the environment can be found in the J2EE how-to which you should read carefully to get the hang of it. Note that the time you invest to learn about J2EE is not only an investment in the future of this project but also in your professional future. First of all, you will need another aspect of J2EE in phase 3 and, second of all, many commercial applications are based on J2EE, so it can never hurt to know it. :-)
A remark about XHTML validity: It can be a little tricky to ensure that all possible dynamic pages of your web shop are 100% valid XHTML, which is why you do not need to do so as a requirement for phase 2. However you will receive a bonus if your web shop is valid XHTML as well! The effort needed for this is relatively small so it might be worth it spending a few thoughts on that. If you do it, please mention it in the report and explain what exactly you did. (Note that this has nothing to do with your report page, which still has to be XHTML valid as a strict requirement. More information below.)
Report
Like for the last phase you must hand in a report at the end of phase 2. This report must be included in the archive you hand-in.
Format
The report web pages must be valid XHTML according to the XHTML 1.0 standard or higher. The choice of a DTD (Strict, Transitional, Frameset) is up to you. Moreover your web pages must be reasonably viewable in browsers that follow the current standards (that obviously does not include Internet Explorer :-).
Please put all your files in an archive (ZIP, RAR, TAR, ...) called groupXX.{zip,rar,tar.gz,...} before you send them to us. (If you don't know how to do this, try something like tar cfz group87.tar.gz phase2/.)
Content
Your report must document all relevant steps of phase 2 in reasonable detail. You should explain your decisions and highlight the important features of your work. The report represents your entire work for phase 2 and your grade will depend on it to a large extent.
Don't hesitate to mention outstanding features, we will honor additional effort. If you have encountered any particular problems during the process you should also mention them. Note that we will take into account the circumstances like the quality of the other two databases of your consortium or general difficulties of the process.
Given the difficulties during the hand-in of phase 1, here's the list of items you must include:
- Complete SQL schemas of your database
- Complete data dumps (in a format readable by psql) of your database (not including the other two databases of your consortium)
- Reasonably detailed explanations of your decisions
- Any scripts (SQL, Java, ...) that you have used in the merging process of the databases
- An explanation of how you have dealt with the currency problem
- Details on how you dealt with missing price information
- Full source code of your web shop
- Links to or binaries of any additional libraries you may use (Only exception: Please don't include the Postgres driver itself!)
- Anything else needed to compile and deploy your J2EE applications (build.xml, etc.)
- The exact and correct address of your web shop
- An XHTML compliant version of your report. There will be no exceptions this time! If the report you hand in is not valid XHTML you will get a deduction of 0.5 points. Also make sure that it's readable offline!
- Anything else that you think may be useful for grading
You must put everything listed above in an archive before you hand it in. If the size of your archive exceeds 5 MB, please make it available for download somewhere (e.g. on your web page or directly on the lsir-cis-pc account) and send us only the link/location.
Please note that this time we will not check the archives you hand in for completeness at reception. Instead, if something is missing and we cannot correct certain parts, you will simply receive less or no points at all for the corresponding criteria. So remember to reread this page thoroughly and double-check the archives you hand-in against the above list!
Please also include a link to your web shop in the e-mail you send us so we can add it to the Web shop list.
Particularities
Once again you are required to secure your infrastructure. You must therefore change your J2EE administrator password so that other people cannot login. Failure to do so will lead to a deduction of your grade by 0.5 points. So if you haven't done so by the time you read this paragraph, now is the best time to make sure you don't forget.
Deadline for report hand-in: Thursday, 2005-05-19, 2400 (hand-in by e-mail to Martin)
Copyright © Martin Rubli & Patrik Bless –
Last change:
This page uses
valid XHTML 1.0 Strict and
valid Cascading Style Sheets, Level 2.
This page uses
valid XHTML 1.0 Strict.
This page uses
valid Cascading Style Sheets, Level 2.