This is the IMDb contributor's newsletter, published every 6-8 weeks. To unsubscribe, send a message to data-news-unsubscribe@mlists.imdb.com. To subscribe, send a message to data-news-subscribe@mlists.imdb.com. You can also use the signup page at http://www.imdb.com/maillists . Feedback on these articles or suggestions for new topics are welcome; contact dnews@imdb.com. The most interesting questions will be used in the next issue. Issue #2 In this issue ------------- - What happens when you submit data - Cast credit order - Historical figures in cast lists - Episode lists - Some comments about AKA titles - UNIX tools 3.18 released - Feedback What happens when you submit data --------------------------------- People sometimes wonder what happens to their data after they submit it. It is not placed online immediately. Once the mail server has accepted your data, it is accumulated until about 8 AM GMT Thursday (11 PM PST Wednesday). The entire week's data is sent to the managers of the various portions of the database. Each list manager then extracts the data for the parts of the database they are responsible for. The data is sorted and duplicates are eliminated. The list managers spend some time making sure the data is formatted correctly and checking for various inconsistencies, such as people working before their birth or after their death (this may indicate two people with the same name, but not necessarily). Some data is checked against official sources. As the various database managers complete work on a list, they upload their information; the database is rebuilt nightly using whatever's been added that day. Some browsable sections of the database are rebuilt on a weekly cycle. That's the normal cycle. However, when you add a new title, it has to go through additional processing. Because people often submit titles that are not really new, or are not appropriate for inclusion, each title must be examined and approved manually, based in part on the data submitted along with the title, which is why it's important to submit as much information as possible along with a new title. This currently adds two to four weeks to the cycle; data will not appear online until the title it is associated with has been approved. In addition, new names must also be approved for similar reasons; this adds about a week delay. If data is submitted to the wrong list (e.g., a casting assistant, which belongs in the miscellaneous crew list, submitted to the casting directors list), rerouting it adds another week or two. While a title or name is awaiting approval, the data is kept to one side. After the title/name is approved, the data is normally included the next time the list is processed, which means it should appear within a week. Unfortunately, it does sometimes get lost if there is an unusually long delay or other problems; we are working to reduce the number of these cases. For certain kinds of data, additional work is needed. Submissions of URLs for new sites are verified to be sure the site meets our guidelines of appropriateness (for example, sites submitted for a title must pertain to that specific title, not a company or actor). Finally, those lists with free-form text need manual copy editing for wording and duplicates. This takes varying amounts of times for the various lists, based on submission volumes and quality, along with the backlog for those lists (see the article last issue about the "TGQ" lists). A reminder that the TGQ backlog is processed in priority order; we've made excellent progress in the last 2 months. For reasons of timeliness, some information provided by IMDb staff bypasses part of this process. Most notably, editors collect box office data and links to reviews at some web sites; these are updated in the nightly build mentioned earlier. We also update biographies when someone notable dies, and information for certain high-profile awards will also appear online much faster. On the IMDbPro site, some of this information doesn't even have to wait for a nightly build. Over the next year, we hope to streamline the submission process, eliminating weekly batching and making changes that should reduce the number of bad title submissions. There will also be opportunities to see and comment on data that has been submitted but not processed. This process has already begun; for example, URLs are processed daily, not weekly. Cast credit order ----------------- The cast of a film is one of two sections of the database that does not necessarily appear in alphabetical order (the other is the writing credits). The rule for determining this order can be confusing, since we don't necessarily list the biggest stars first. The rule is this: the correct order for credits is that of the most comprehensive cast list, which in modern films is usually at the end. If that leaves the stars way down in the list, so be it. We do have another system for marking principal cast members that we have not yet fully deployed; that will allow us to feature those actors on the overview page regardless of cast order. We also expect to some day flag whether the cast is billed alphabetically or in order of appearance (the two most common counter-billing orders). Historical figures in cast lists -------------------------------- We have many appearances for historical characters playing themselves (e.g. Richard Nixon). It's very hard to draw the line here because in some cases those credits are valid and useful to have. Keeping Nixon as an example, some cases where the 'credit' is valid: # "Cold War" (1998) (mini) # Reel Radicals: The Sixties Revolution in Film (2002) (TV) # Making of a Leader (1919-1968), The (1994) (TV) # Houston, We've Got a Problem (1994) # Secret Life of Richard Nixon, The (2000) (TV) In other cases the credit is superfluous and should go. For example: # Frequency (2000) # Contact (1997) # Doors, The (1991) Even though footage of him was used in those films (and we mark his appearance as 'archive footage'), these appearances do not belong in the main cast list. The problem is that in many cases it's hard or impossible to make the distinction unless you are familiar with the film. For example, when you see a credit like "Watergate" (1994) (mini), you don't really know if this is a legitimate documentary appearance or some fictional based on fact program that uses footage of Nixon the same way Contact (1997) or JFK (1992) do. In some cases it can be determined by checking the data we have on the title (whether it's a docu, whether all other credits are for professional actors or historical figures etc.) but that requires tools/time/effort that we don't have right now. We do reject many similar credits (many appearances by Hitler, Bill Clinton, JFK or other historical figures are rejected every week). The ones that are listed in the database managed to creep into the lists. Not all submitters share the view that we should reject those credits; some see them online, assume that this is the norm, and send more of them. All this makes it harder to reject them, especially if we am not familiar with the titles involved. At this time we are erring on the side of accepting the credits when in doubt, and possibly removing them later when they are determined to be invalid. At some future time, we may create another way of listing such appearances that would clearly separate them from the main cast list. Episode lists ------------- As the coverage of television episodes has grown, some crew members have accumulated episode lists that have become unmanageably long. A more comprehensive solution to episodes is in the works, but until it arrives, we are using another approach. Where in the past you may have submitted multiple episodes in a single entry, like this: Spotnitz, Frank|"X Files, The" (1993)|(episodes "Alone (2001)", "Daemonicus (2001)") you should now submit each episode separately, like this: Spotnitz, Frank|"X Files, The" (1993)|(episode "Alone (2001)") Spotnitz, Frank|"X Files, The" (1993)|(episode "Daemonicus (2001)") Existing entries are being converted. In some cases, episode lists were temporarily replaced with "(multiple episodes)"; these should be converted back shortly as well. Some comments about AKA titles ------------------------------ After the last newsletter, we got some feedback about aka titles (alternative titles) in IMDb. This is a response to those remarks. We mentioned in the last issue that IMDbPro displays USA titles where available. This includes only those titles marked (USA) with no additional attributes like (informal English title). Thus, a title that is only a translation used in a review and not an actual release title should be marked appropriately; other possibilities are (informal literal English title) and (video title). Unfortunately, we add about 1000 aka titles each week, and are unable to investigate each one in depth. It's thus more important than ever to be sure to use the correct attributes on alternative titles. If a title should have an attribute and does not, please use CORRECT-AKA to point out the omission. The attribute (theatrical title) should only be used for, and is only present on, TV movies, mini-series, and video titles that would not normally get a theatrical release. There are many alternate titles with no attribute from the time before we attached attributes to alternate titles; again, if you know what the correct attribute should be, please report it with a CORRECT-AKA. The year in an aka title should correspond to the year that title was used. Our tools will, by default, force the year in the aka to match the year in the primary title. However, if the aka title specifies a country and we have a release date in that country with a different year, the year in the aka title will be corrected to match. Years in less structured lists, such as distributor attributes, cannot be used for this purpose; the only release years that really matter are those in the release date list. Finally, we recognize that aka titles for languages using non-Roman alphabets are not always consistent. While our title manager is fluent in four languages, there are many languages where his knowledge is minimal to zero. Correcting transliterations requires detailed knowledge about the original language, character set and transliteration rules. We do not have this knowledge for Japanese, Russian, Indian languages etc. We depend on the knowledge of our users here. The usual ways of correcting data applies here as well. There is no satisfying solution to this problem as long as no experts are available that basically debug the complete set of titles for one language and enforce standards to be used on every single title. UNIX tools 3.18 released ------------------------ Version 3.18 of the locally installed version of the database package (moviedb) has been released. It can be found at the usual FTP sites; see http://us.imdb.com/interfaces for details. There is one major change in this release. The previous versions could only handle 60,000 titles with votes; that limit has been removed in this version. In addition, various compile-time warnings should no longer occur. Installation remains the same as for earlier releases. Note that if you are using the X Windows interface, xregal, it cannot be compiled with most new releases of X. However, the changes in this release do not require recompilation of xregal, so if you have a working binary of xregal, keep using it. Alas, the author of xregal has chosen to stop supporting it, so a newer version is not available. To rebuild: Extract the tar file into a directory named database. Assuming you already have a copy of the database files, from ./database/ : make compile make installbin cd imoviedb; make; make install # If you need to build xregal and are able to: # cd ../xregal; make; make install cd .. make cleandbs make update-local ./etc/cgencompl -all # optional If it's not working for you, check the following things first: . Do you have enough disk space? . Are the source files for moviedb up to date? . Are all the binaries in database/bin/ and database/etc/ up to date? . Did you do *all* relevant steps above in the order listed? For further support, contact unix@imdb.com. Feedback -------- Thanks to the people that commented on the first issue of the newsletter. By far the most popular questions centered around our processing cycle, which is why the lead article in this issue is an overview of that cycle. Another popular question had to do with the proper method of correcting names; a major article on that subject is planned for the next issue. Many of the other articles in this issue were also inspired by user questions. Some other questions (summarized): Q: Don't goofs take a lot of time to check? Are people really that interested? A: They actually take more time to edit for readability and check for duplicates than to research, but yes, our logs show that goofs are a very popular part of the database. Q: Do you accept submissions for animal performers? A: We've recently changed our policy on this. If an animal performer is credited in the cast list, they can now be submitted with the regular cast list. If an animal performer is uncredited, or if their credit is buried in the miscellaneous credits of a movie, then we do not accept them. You should make your best guess of the actual gender of the animal when determining whether to submit it to the actor or actress list. Q: Why hasn't my miscellaneous crew submission appeared? A: Backlog on this list was running about 4 weeks. This has recently been cleared and is now back to normal. --------------------------------------------------------------------------- IMDb - Data Contributor's Newsletter - Issue 2 - THE END