Name Standardization

From Sunlabs wiki

Jump to: navigation, search

Names of entities-- donors, members of congress, corporations, even governments are not called the same thing between documents or databases or even in the same document. For instance, in the case of the Federal Election Commission data files, donors can be called William Smith, Billy Smith, Billy Smith, JR. or a plethora of other names. Corporations go beyond this by having multiple names-- Lorne Michaels is not only the executive producer for Saturday Night Live, but the CEO of Broadway Video and an employee of NBC Studios, a subsidiary of General Electric. Members of Congress are also referred to by different names in different filings and databases. Sunlight has partially solved this problem with the search function in its API. Because these names are generally non-standardized it is difficult to compare names across databases without human intervention.


Contents

Name Standardization Problems

Members of Congress

Members of Congress go by different names and are referred to differently in different documents. Fortunately this problem is easily solvable with good search. Sunlight Labs has solved this for the 110th Congress with the Sunlight Labs API, and will maintain it going beyond the 110th. This problem is solved mostly by a search algorithm that assumes that filings will get the initials of the member correct, or refer to them as Senator or Representative. The efficacy of this approach is demonstrated on the Sunlight Labs Blog through a simple Google Spreadsheet mash-up.

Lobbyists

Lobbyists have more common names, and is a more difficult problem to solve. With over 15,000 lobbyists and both chambers of congress releasing data, combining and standardizing the names between these two databases (Senate House) is difficult. Presently Sunlight Labs is working on this problem and hopes to have Lobbyist standardization completed by the end of Q4 2008 by applying the same methods it used for Congressional name standardization.

Corporations

Corporate name standardization is undoubtedly the hardest, as the names are often reported differently inside of the same database, subsidiary information is hard to map, and the number of misspellings or different representations of the same entity occur greatly. Wal*Mart, for instance, is a classic example, being represented as Wal Mart, Wal*Mart WalMart, Walmart Stores, Wal*Mart Stores, Inc. Walmart, Inc. and other combinations all within the same databases. This problem is presently untackled, and needs support.

Contributors

The standardization of individuals is useful up to a point. While a generic tool would be useful to judge whether two entity names are likely the same based on a set of characteristics, the knowledge that two individuals are the same are only useful in certain cases (For instance, in the case of someone who makes a lot of campaign contributions, is a also a lobbyist or spouse of a lobbyist/member of congress, or sits on the board or an executive of a corporation).

Techniques

  • Just adopt Wikipedia's name for everything and its conventions as they evolve, and refuse to work with or sanction officially any other list of names or conventions, eventually Wikipedia's will just win out.
  • Nickname table
    • The http://dkosopedia.com list of tags, for instance, maps every tag used on http://dailykos.com to a particular article name on dkosopedia which are standardized to the same names and conventions as used in Wikipedia - by far the most desirable and standard list of names of anything because it's translated, heavily scrutinized for neutrality and can be improved or updated by anyone - use of redirects for many alternative names is standard and common
  • Rules (like "drop inc.")
    • Exceptions (always add/drop "inc." for certain names)

CRP's rule:

  • Law firms: strip off all partner names after the second and say "et al."

Untested stuff that might be worth trying:

Personal tools