Jump to ContentJump to Main Navigation
Green GradesCan Information Save the Earth?$

Graham Bullock

Print publication date: 2017

Print ISBN-13: 9780262036429

Published to MIT Press Scholarship Online: May 2018

DOI: 10.7551/mitpress/9780262036429.001.0001

Show Summary Details
Page of

PRINTED FROM MIT PRESS SCHOLARSHIP ONLINE (www.mitpress.universitypressscholarship.com). (c) Copyright The MIT Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in MITSO for personal use (for details see http://www.mitpress.universitypressscholarship.com/page/privacy-policy). Subscriber: null; date: 20 June 2018

Measuring Green: The Generation of the Information

Measuring Green: The Generation of the Information

(p.103) 4 Measuring Green: The Generation of the Information
Green Grades

Graham Bullock

The MIT Press

Abstract and Keywords

Chapter 4 uses the differences between LEED, Green Globes, the Living Building Challenge, and ENERGY STAR to highlight the methodological issues associated with information-based governance. The chapter introduces the concepts of replicability, reliability, and validity, and applies them in an analysis of the transparency and quality of the data and methods used in existing information-based initiatives. The importance of life cycle approaches to developing valid environmental information about products and companies is also discussed. With a few important exceptions, data from 245 cases of these initiatives demonstrates their general lack of methodological transparency and validity, and highlight the challenges associated with developing robust metrics of sustainability. The chapter discusses several important tradeoffs between different dimensions of validity, and suggests several strategies for managing these tradeoffs. It also identifies the most promising and problematic information generation practices found in the database and the lessons learned from these examples.

Keywords:   Buildings, ENERGY STAR, LEED, Validity, Reliability, Transparency, Life Cycle Analysis, Academia

Which Is the Greenest Building of Them All?

Like Mark choosing milk products and Carrie selecting tissue paper, Lynn is faced with a decision. This decision is not for her alone, however, but for her entire institution. As I mentioned in chapter 1, Lynn is an environmental scientist at a university that is in the process of designing a new academic building. She is in the midst of a planning meeting with the architects, and they are discussing what green attributes they want the building to have. Some of her colleagues insist that the new building be LEED certified, while others dismiss LEED as arbitrary, nonsensical, and too expensive. Some are arguing for other certifications, such as the Living Building Challenge, Green Globes, and ENERGY STAR. They all turn to Lynn for her opinion, and ask her which system she thinks provides the most valid metric of sustainability performance.

Lynn has done her homework, and what she has read about LEED has left her with decidedly mixed feelings. LEED (short for Leadership in Energy and Environmental Design) was launched in 1998 by the U.S. Green Building Council (USGBC), and claims to have certified over 72,000 projects comprising over thirteen billion square feet across more than 150 countries and territories (as of February 2016).1 The USGBC claims that eighty-eight of the Fortune 100 companies use LEED and “nearly five million people experience a LEED building every day.”2 It has trained and certified nearly two hundred thousand LEED professionals who have the authority to perform audits of buildings applying for certification.3

LEED’s 2014 version includes a broad range of criteria across nine categories that range from indoor environmental quality to water efficiency to the sustainability of the building site, and addresses many of the (p.104) public environmental concerns discussed in chapter 2. The USGBC website emphasizes the cost savings and business benefits of LEED certification as well. As an example, it cites a study that shows customers opened up more accounts and deposited more money at LEED-certified bank branches than noncertified branches.4 It also claims to be a broad-based organization, with seventy-six chapters and 12,870 member organizations, which include “builders and environmentalists, corporations and nonprofits, teachers and students, lawmakers and citizens.”5

Despite these impressive statistics, Lynn is also aware of criticisms of LEED’s rating system. One of the most common complaints is about LEED’s credit for bike racks, which critics view as either not worthy of such recognition, too easy to achieve, or not expansive enough (e.g., it doesn’t include credit for giving employees bikes or locks).6 Other critiques of LEED focus on the credits it gives for energy savings projected in computer models but not actually measured, its recognition for preferred parking for fuel-efficient cars that ultimately are used by SUVs and sports cars, and its certification of enormous homes that are located in remote, pedestrian-unfriendly neighborhoods.7 A USA TODAY study found that building designers focus on the easiest and cheapest credits, and a study by John Scofield at Oberlin College concluded that LEED-certified buildings are no more energy efficient than comparable noncertified ones.8 Energy efficiency expert Henry Gifford classifies the criticisms of LEED into four categories—the “Sin of Not Following Through,” the “Sin of Valuing Gizmos over Appropriate Design,” the “Sin of Laughably Inappropriate Use,” and the “Sin of Wretched Excess.”9

Several organizations have capitalized on these perceived weaknesses of LEED to launch and promote their own competing building rating systems. Green Globes is a project of the Green Building Initiative (GBI) that is positioning itself as a flexible, practical, innovative, and credible alternative to LEED. The GBI website emphasizes that it is a multistakeholder initiative that has developed Green Globes through “a public, collaborative, consensus-based process.” Like LEED, it also highlights the private benefits of certification, stating, for example, that it helps organizations qualify for tax incentives, meet government regulations, and attract and retain employees.10 Green Globes appears to cover a similar set of environmental categories as LEED, although critics assert that it is less rigorous and more friendly to industry.11 For example, they point out that while LEED only (p.105) recognizes FSC-certified building products, Green Globes also gives credit for the purportedly industry-friendly certifications from SFI, the American Tree Farm System, and the Canadian Standard Association.12 Nevertheless, in 2013 the U.S. General Services Administration recommended that federal agencies should use either LEED or Green Globes as certification systems for their buildings.13

Rather than compete directly with LEED as a mainstream, broad-based certification, other initiatives are pursuing more focused strategies. For example, ENERGY STAR, a federal program implemented by EPA in partnership with the U.S. Department of Energy (DOE), certifies both commercial and residential buildings for their energy efficiency, but does not include any of the criteria related to indoor air quality, water efficiency, materials use, and other issues covered by both LEED and Green Globes. Thus it is not as comprehensive and may be guilty of the sin of the hidden trade-off (discussed in chapter 2), but it may be more rigorous and valid. As some commentators have pointed out, while LEED provides no guarantee that a building is actually “energy efficient,” ENERGY STAR requires buildings to submit their utility bills before they are certified and on a continuing basis.14 Other programs, including the National Green Building Standard, Passive House, and the Home Energy Rating System, have focused on energy use in the residential market.

The Living Building Challenge (LBC), on the other hand, is positioning itself as the most rigorous, holistic, and comprehensive certification system available. A project of the International Living Future Institute, the Living Building Challenge aims to help buildings “move beyond merely being less bad and to become truly regenerative.” Its categories, or “Petals,” are decidedly more far reaching than the other programs, and not only encompass water, materials, and energy use but also health, happiness, equity, and beauty. They consist of twenty required “Imperatives” that are based on metrics of actual performance and include requirements such as “net positive” energy and water use (e.g., 105 percent of the project’s energy needs must be supplied by on-site renewable energy on a net annual basis, without any on-site combustion).15

In the face of this competition, the USGBC has not been sitting still, but has been actively responding to its critics. Many of the criticisms leveled against it have at least been partially addressed in its 2014 version of LEED (v4), which includes, for example, a stronger focus on building (p.106) performance related to materials, indoor air quality, and water efficiency.16 In the face of these competing claims, what should Lynn recommend to her colleagues? To some extent, this question depends on her sense of the trustworthiness of these organizations (the subject of chapter 3), as many of the criticisms of these organizations relate to their legitimacy and accountability. But it also depends on the basic validity of their methods, and at some point we have to move beyond the trust issue and actually evaluate, compare, and decide among the underlying systems used by these initiatives. And if Lynn does not make such an assessment, as an environmental scientist concerned about sustainability issues, then who will? But for her or anyone else to proceed, they need a framework for making such a complex comparative evaluation.

That is the focus of this chapter, which introduces the concepts of replicability, reliability, and validity as useful tools for people like Lynn to evaluate competing eco-labels and sustainability ratings. The chapter then applies these concepts in an analysis of the transparency and quality of the data and methods used in existing information-based environmental governance initiatives. The chapter explores the different types of transparency that these programs should have, and highlights the importance of updating the data and methods used in their assessments, documenting the weights of the different indicators that make up their composite metrics, and applying the insights of life cycle analysis in their design.

With a few important exceptions, the data from my Environmental Evaluations of Products and Companies (EEPAC) Dataset—which not only covers cases from the building sector such as LEED and ENERGY STAR but also from many other sectors as well—demonstrate the general lack of methodological replicability, reliability, and validity among existing information-based environmental governance initiatives. The dataset highlights the significant challenges associated with developing robust metrics of sustainability. The chapter discusses several important trade-offs between different dimensions of information quality, and suggests several strategies for managing these trade-offs. It also identifies the most promising and problematic information generation practices found in the dataset and the lessons learned from these examples. The chapter concludes by applying the insights of this analysis to Lynn’s predicament and the question of which building assessment program is indeed the greenest of them all.

(p.107) An Information Quality Framework: Validity, Reliability, and Replicability

A brief metaphor will help reveal the three major questions we must ask when assessing the information quality of different ratings and certifications. When Snow White’s stepmother asks her Magic Mirror who is the fairest of them all, he responds, “You, my queen, are fair; it is true. But Snow-White is a thousand times fairer than you.”17 This response begs several important questions. First, how replicable is his assessment, meaning, how repeatable is his analysis of “fairness” by others? Second, how reliable is his assessment, and how consistently does he measure “fairness?” And third, how valid is his assessment? That is, how accurately has he assessed “fairness?” If the Mirror cannot address these questions about his evaluation methods, then the Queen has every reason to doubt the veracity of his claim.

The Statistician Should Have No Clothes: Replicability

The Magic Mirror metaphor thus introduces us to the three key dimensions of information quality discussed in this chapter—replicability, reliability, and validity. These three dimensions are presented in figure 4.1, along with several related concepts that I will discuss in more detail. Replicability means that results of an analysis can be reproduced. In order for this to be possible, the measurement process must be transparent so that it can be repeated (i.e., replicated). The importance of such transparency has been emphasized across a wide range of governance contexts in the last few decades, from “sunshine” laws requiring government agencies to make their meetings open to the public to calls for corporations to disclose their environmental and social performance.18 Scholars have also emphasized the value of transparency, both in the context of their own work (e.g., by requiring the public dissemination of datasets and metadata along with the publication of articles) and the activities of industry, civil society, and government.19 However, they have seldom distinguished between different forms of transparency.20 In the context of information-based governance initiatives, these programs need to disclose not only their criteria and data sources, but also the data itself and the specific methods they use to analyze it. Thus if Lynn wants to assess the replicability of a certification such as Green Globes, she should identify whether it is transparent about each of these different aspects of its evaluation process. (p.108)

Measuring Green: The Generation of the Information

Figure 4.1 Dimensions of information quality.

Consistency in Claims: Reliability

Even if a certification is transparent and replicable, however, it may still have significant methodological limitations. This brings us to our next question—how consistent are the results of the evaluation process? This is the essence of reliability—that the metric generates similar results each time it is used to measure something. Researchers are generally concerned about two types of such reliability. Test-retest reliability refers to the consistency of a metric over time, while inter-rater reliability relates to whether similar results are generated by a metric when it is applied by different people.21 Thus Lynn might like to know that evaluators for the Living Building Challenge, for example, not only apply the program’s criteria the same way for every building they evaluate, but also apply them the same way as each other.

Truth in Advertising: Validity

The final and most important dimension of information quality is validity, which refers to the extent to which a metric actually represents a phenomenon. In other words, how well does it measure what it claims to be measuring? Validity is a central concept used across both the social and (p.109) natural sciences, and different fields define it differently. For our purposes, we are primarily concerned with construct validity, which is a measure of the quality of a particular operationalization of a concept or behavior—the translation of something in the real world into a functional representation or metric, such as the certifications that Lynn is evaluating.22 Such construct validity can be substantiated by either evaluating the quality of the operationalization directly or comparing it to other criteria that are related to the construct in question.23 So Lynn might assess the validity of LEED by analyzing its own operationalization of building greenness and assessing the relevance, comprehensiveness, and specificity of the criteria it uses. Alternatively, she might compare its outcomes to other criteria she believes are related to building greenness, such as energy efficiency.

The difference between validity and reliability is subtle, but important. Lynn may find that a particular green building certification consistently applies its criteria, but those criteria may not necessarily be accurate assessments of building greenness. Likewise, she may feel that another certification has better criteria, but that they are not applied reliably over time or by different evaluators. This distinction relates to the difference between accuracy, a synonym of validity, and precision, a synonym of reliability. Measurement accuracy refers to the closeness of fit between a metric’s estimation of a phenomenon and the phenomenon itself, while measurement precision refers to the level of random error associated with an estimate. Suffice to say, a high-quality metric produces both accurate and precise estimates. Lynn is therefore looking for a certification that is based on a valid and accurate measure of building sustainability, is produced reliably and with a high degree of precision, and can be replicated due to its high level of transparency.

The Quality of the Information Generated by Information-Based Environmental Governance Strategies

The three characteristics discussed previously—validity, reliability, and replicability—are the core foundations of what we can call information quality. High-quality information is generated by methods that are valid (they measure what they say they are measuring), reliable (they consistently produce similar results), and reproducible (they can be repeated). Expanding beyond the specific context of building certification, this section examines (p.110) the extent to which information-based environmental governance strategies are indeed generating such high-quality information. While some definitions of information quality also include dimensions relating to its relevance, comprehensiveness, and accessibility, these are the focus of other chapters (chapters 2 and 5). This section first presents data related to the replicability of these initiatives, and then summarizes the results of an analysis of their reliability and validity.

The Key to Replicability: Transparency

As mentioned in chapter 3, transparency is not only an important dimension of information quality, but it is also a key aspect of trustworthiness and legitimacy. It is one of the most frequently mentioned subjects in the peer-reviewed literature on eco-labels and ratings,24 and encompasses the more specific concepts of traceability and auditability.25 University of Michigan professor of business law and ethics David Hess, for example, points out that “to have meaningful stakeholder engagement requires that we first have a robust information-based transparency policy with comparable data.”26 Likewise, Harvard scholar Archon Fung asserts that democratic transparency requires the disclosure of rich, usable, and actionable information whose availability is proportionate to the risks to which it is relevant.27 Graeme Auld of Carleton University and Lars Gulbrandsen of the Fridtjof Nansen Institute differentiate between procedural transparency, which refers to the “openness of governance processes” and relates to the concept of input legitimacy, and outcome transparency, which “deals with the substantive ends of a given policy intervention” and can contribute to output legitimacy.28

The process of developing my codes for transparency revealed that it is a more complex and multidimensional concept than is usually acknowledged. In developing the coding system for the EEPAC Dataset, I identified four primary ways that initiatives can be transparent about their evaluation processes, which will be described in more detail later in this section. The coding process also distinguishes between limited and strong statements of transparency. Strong criteria transparency, for example, indicates that all of the criteria are fully explained, with at least a sentence about what is being measured and what data is being used for each cited criteria, while limited criteria transparency indicates that some but not necessarily all the criteria (p.111) are listed, and they may or may not be described in any detail. Discussion of the coding results follows.

Figures 4.2 and 4.3 summarize the data for each dimension of transparency. Criteria transparency refers to the extent to which a case describes the criteria they use in their evaluation of either products or companies. Sixty-seven of the cases, or 27 percent, describe some but not all of the initiatives’ criteria (limited criteria transparency), while 51 percent describe their criteria in full detail (strong criteria transparency). An example of a case that describes its criteria in full detail is EPA’s Design for the Environment Standard for Safer Products, which documents both the product and component-level requirements for certification. An example of a case that describes some but not all of its criteria is Fortune’s description of companies on its Green Giants list as having “gone beyond what the law requires to operate in an environmentally responsible way.”

Data transparency refers to whether an initiative provides the actual data underlying the evaluation on its website. The content analysis of the full

Measuring Green: The Generation of the Information

Figure 4.2 Types and levels of transparency (by percent of cases). Note: Error bars indicate 95 percent confidence intervals for each “limited” and “strong” sample proportion.


Measuring Green: The Generation of the Information

Figure 4.3 Levels of method transparency.

Notes: Error bars indicate 95 percent confidence intervals for each sample proportion. Adapted from Bullock, “Signaling the Credibility of Private Actors as Public Agents,” 203.

sample indicates that over 40 percent of the cases provide none of their underlying data, 35 percent provide some but not all of their data (limited data transparency), and 24 percent provide all of their underlying data (strong data transparency). Source transparency refers to whether a case provides a list of the sources of the data that is the basis of its evaluation. Approximately one-fifth of the cases have limited source transparency (some but not all of the sources are listed), and another fifth have strong source transparency (all of the data sources are listed). Three-fifths do not provide any information about their data sources.

Method transparency refers to the level of detail provided about how the evaluation was conducted. Given the complexity of this characteristic, four binary codes indicating increasing levels of method transparency were used to document this characteristic (see figure 4.3). These codes capture two dimensions of method transparency—the specificity of the information (detailed vs. general) provided about the methods used and the completeness of that information (complete vs. incomplete). Approximately one-third of the programs provide a detailed and complete description all the methods, algorithms, and processes necessary to replicate the results of their assessment, 26 percent provide most but not all of the information necessary to (p.113) replicate their results (detailed but limited description), 13 percent provide a complete but general description about their evaluation process, and 7 percent provide a limited and general description of their methods. The remaining 22 percent provide no information on their methods at all. Cases implemented by firms were significantly less likely than cases implemented by evaluation organizations to provide detailed and complete methodological descriptions and more likely to provide limited and general descriptions.29 No significant difference was found between cases implemented by firms and cases implemented by evaluation organizations for the other forms of transparency.

An example of a case that provides the most limited amount of methodological information is Sierra Club’s Pick Your Poison Guide to Gasoline, which states their editorial interns “lump [oil companies] into three general categories, the ‘bottom of the barrel’ (ExxonMobil and ConocoPhillips), the ‘middle of the barrel’ (Royal Dutch Shell, Chevron, Valero Energy Corporation, and Citgo), and the ‘top of the barrel’ (BP and Sunoco).” The Green Loop is an example of a case that provides limited and general information, as it outlines a three-step evaluation process for screening products for “sustainability and aesthetics.” An example of a case that provides most of the information necessary to replicate their results (detailed but limited information) is the Greener One, which outlines the specific criteria used to calculate its Green Index, but does not explain how the scores are calculated. An example of a case that provides a detailed and complete description of their methods is the University of Massachusetts Toxics 100 Air Polluters Index, which explains in detail where its data comes from and how it compiles that data into its own score.

The Many Faces of Reliability: Missing in Action

Cases that are transparent and replicable meet a basic requirement of information quality, but they must also be reliable. If the designers of information-based initiatives are serious about confirming that their information is indeed reliable, they should cite their own studies of their metrics’ reliability or cite those done by academics or other third parties. And their websites should discuss how they have determined that their metrics are being consistently applied over time and across evaluators.

Following this logic, I used the software program MaxQDA to conduct a lexical search of the text from the websites of the 245 cases in my dataset (p.114) for any use of the terms of “reliability,” “reliable,” and “consistency.” I found sixty-three references to consistency across seventeen initiatives and 130 references to either reliable or reliability across fifty-six initiatives that relate to the quality of the information they provide.30 This analysis revealed several important insights. First of all, as figure 4.4 shows, 76 percent of the cases do not make any reference to either the consistency or reliability of their data. Approximately 2 percent only mention reliability or consistency as general ideals to work toward, and another 9 percent make general claims about the reliability or consistency of their data that are not clear about what they mean by these terms. Another 3 percent make limited claims about the reliability or consistency of their data sources, but not their own information.

Interestingly, three cases—Climate Counts, The Power Scorecard, and EcoTrotters—have legal disclaimers that explicitly state that they do not guarantee that their information is reliable. Climate Counts, for example, states that it “makes no representations about the accuracy, reliability, completeness, or timeliness of the materials or of any statements or

Measuring Green: The Generation of the Information

Figure 4.4 Levels of reliability/consistency.

Note: Error bars indicate 95 percent confidence intervals for each sample proportion.

(p.115) other information displayed or distributed through the site.” And a fourth initiative, Covalence’s EthicalQuote, explicitly states that “it does not see some sources as more reliable than others” and considers all sources equally because it “does not validate information sources” or the content of their information.

The remaining twenty-two cases (9 percent of the total) do make some specific claims about the reliability of their data. Six refer to reliability in the sense of trustworthiness, expertise, and independence, while a seventh uses it more in the sense of accuracy and validity. Beyond Grey Pinstripes explicitly states that its method of blind scoring by pairs of coders was “done to obtain inter-rater reliability,” while three other cases implicitly describe their efforts to increase this type of reliability. The HERS Index, for example, describes its requirement for raters to undergo ongoing training “so that customers and the public can be assured of receiving competent and reliable services.” SCS periodically splits its samples for its NutriClean Pesticide Free certification and sends them to multiple laboratories to test their consistency. While no case explicitly describes their test-retest reliability, three cases describe their data in the sense of consistency across tests. The Carbon Disclosure Leadership Index, for example, states that one of its underlying principles is for its underlying data collection processes to “use consistent methodologies to allow for meaningful comparisons of emissions over time.”

Three cases—PEFC, Responsible Care, and UTZ Certified—describe the reliability of their certifications in the sense of harmonization across the different programs, countries, and companies that are implementing them. While similar to inter-rater reliability, this idea of harmonization suggests a broader consistency across not only raters but also the designers of rating systems themselves. Four other cases emphasize the importance of comparability in their discussion of reliability. For example, the National Fenestration Rating Council emphasizes that its certification enables builders and consumers to “reliably compare one product with another,” while the Air-Conditioning, Heating, and Refrigeration Institute declares that its certification of HVAC equipment “provides consumers with a reliable apples-to-apples comparison of equipment they are considering purchasing.” Two other cases mention specific mechanisms by which they ensure their information is reliable. GoodGuide states that it uses a “quality assurance and quality control (QA/QC) processes” to (p.116) ensure the reliability of its data, while Citizens Market relies on its user-driven review process, which enable users to rate the quality of the data it uses to evaluate companies.31

The Varieties of Validity: From Generalities to Specifics

Clearly, most cases in my dataset have not demonstrated their ability to reproduce their results across time and across raters, and none of them have fully met the standards of reliability outlined in this chapter. They also have limited replicability and transparency. But perhaps the programs are more effective at documenting their validity, the third dimension of information quality. This dimension is focused on the extent to which a metric is measuring what it is claiming to measure. Specifically, do these initiatives document the construct validity and measurement accuracy of their approaches to evaluating the sustainability of products and companies?

In order to address this question, I conducted lexical searches similar to the reliability analysis described earlier. I first searched the website text of the 245 cases for uses of the terms “valid,” “validity,” “validate,” and “validation.” In reviewing the 782 results generated by this search, I found that the ninety-nine initiatives using these terms were employing them in a wide variety of ways. Many were using them in the sense of trustworthiness and independence explored in chapter 3 (similar to some of the uses of reliability discussed previously), while others were using them to describe things not relevant to the quality of their metrics (e.g., this is a “valid objective of the company”). I therefore decided to narrow my search to only include uses of “validity,” based on the assumption that those initiatives that were serious about assessing their validity should explicitly reference the term. Given that for many people, despite the important distinctions described in this chapter, validity and accuracy are close synonyms, I conducted a search of “accuracy” as well. This search is also relevant to assessing the extent to which the cases discuss their measurement accuracy and ability to detect and avoid errors and bias in their data collection processes.

The “validity” search yielded sixty-six results across twenty-six initiatives. Five of those cases only mention validity in a general sense. For example, the Cradle to Cradle certification states that “the certifying body will judge the validity and efficacy of each applicant’s [material reutilization] program on a case-by-case basis,” and does not explain what it means by (p.117) validity. In contrast, the Power Scorecard provides a more detailed description of its construct validity, describing why and how it has included emission reduction credits in its scoring of electricity providers. Two additional initiatives define their validity by comparing themselves to other related criteria. Bluesign, for example, asserts that it has strong validity because its standard is oriented “to the most stringent and most commercially significant regulations and laws around the world.”

Seven cases use validity in the sense of measurement accuracy. The Toxics Release Inventory, for example, describes the multistep process it employs to check for submission errors and “verify the validity” of the data companies have submitted. As a specific mechanism for ensuring their measurement accuracy, four cases discuss their validity in the context of audits they require. The Friend of the Sea certification, for example, states that the validity of its certificate is “dependent on the outcome of subsequent yearly surveillance activities.” Another aspect of measurement accuracy is whether the evaluation’s data and evaluation process have been updated recently. Four cases focus on their requirements for updates and recertification as indicators of their validity. GlobalG.A.P, for example, states that its certificates have “an initial validity of twelve months,” while The Gold Standard notes that the validity of its versions is also time-delimited, and after a one month grace period companies must use the most recent and valid version.

How up to date both the underlying data and the criteria used to assess those data are is indeed an important aspect of measurement accuracy. For this reason, I also coded every reference to the generation and publication dates of both the data and criteria used by these initiatives, as well as any discussion of how they update and keep current these two critical components of their evaluations. Just over 13 percent state they have updated their criteria through explicit and systematic review processes, 10 percent have updated their criteria through ad hoc and limited review processes, and less than 1 percent have pending updates. Over 70 percent do not mention the age of their criteria, and over 75 percent do not mention any updating process for their criteria. Approximately 30 percent of the cases have updated their data through explicit and systematic processes, nearly 6 percent have updated their data through ad hoc and limited processes, and less than 1 percent claim their data is currently undergoing an updating (p.118) process. Nearly 70 percent do not mention the age of their data, and over 60 percent do not mention any data updating process.

My lexical search for “accuracy” resulted in 241 hits across fifty-seven cases. The term was only used in a general sense by thirty-one cases. An example of such a reference is the Corporate Responsibility Index’s claim that it reviews “all company submissions to ensure completeness, accuracy, and consistency.” Twenty-six other cases used it in more specific ways. One initiative, Carma, uses accuracy in the sense of predictive validity, stating that it has verified that its statistical model predicts “actual emissions with high accuracy, using officially-reported emissions from thousands of power plants in the U.S., Canada, the European Union and India.” Another case, HealthyStuff.org, refers to accuracy in the sense of test-retest validity, stating that it takes repeat samples “in order to evaluate the variation per product [and] to assess and verify the accuracy of [its] testing.” Six cases refer to accuracy in the context of their peer review or auditing processes, while two others refer to it as dependent on the expertise of the evaluators. Rainforest Alliance goes on to emphasize that assessments “must involve individuals who are familiar with the particular region and type of forest” and “use region-specific standards.” Figure 4.5 summarizes the results of these two sets of searches, and shows that over 70 percent of the cases do not mention either the accuracy or validity of their claims.

Measuring Green: The Generation of the Information

Figure 4.5 Levels of validity/accuracy.

Note: Error bars indicate 95 percent confidence intervals for each sample proportion.

(p.119) It is important to note that just because an initiative makes a claim about its validity or accuracy does not guarantee that it is indeed valid or accurate. While detailed assessments of the validity of all 245 cases in my dataset are beyond the scope of this chapter, I did look at two additional and more specific factors that I believe are important indicators of information quality. The first is the use of life cycle analysis (LCA) concepts and techniques in evaluations of products and companies. As the EPA explains:

Life cycle assessment is a “cradle-to-grave” approach for assessing industrial systems. “Cradle-to-grave” begins with the gathering of raw materials from the earth to create the product and ends at the point when all materials are returned to the earth. LCA evaluates all stages of a product’s life from the perspective that they are interdependent, meaning that one operation leads to the next. LCA enables the estimation of the cumulative environmental impacts resulting from all stages in the product life cycle, often including impacts not considered in more traditional analyses (e.g., raw material extraction, material transportation, ultimate product disposal, etc.). By including the impacts throughout the product life cycle, LCA provides a comprehensive view of the environmental aspects of the product or process and a more accurate picture of the true environmental trade-offs in product and process selection. The term “life cycle” refers to the major activities in the course of the product’s life span from its manufacture, use, and maintenance, to its final disposal, including the raw material acquisition required to manufacture the product.32

The use of LCA helps increase the comprehensiveness of sustainability evaluations and avoid the sin of the hidden trade-off, one of the key themes of chapter 2. It is also a relatively rigorous process that clearly defines processes and their boundaries, quantifies both human health and environmental impacts, and tracks uncertainties. While it has its limitations, it is currently the most sophisticated and widely accepted method available for systematically assessing the relative performance of different products and companies, and any valid environmental certification or rating system should incorporate it into its design. At a minimum, such initiatives should include criteria relating to LCA, and ideally, they should themselves be designed around the concept of the life cycle and make use of the many techniques available from the field of industrial ecology to assess the impacts associated with each phase in that cycle.

For these reasons, I also conducted a lexical search of the EEPAC Dataset for “life cycle analysis” (or “assessment”) and “LCA.” I found 167 relevant LCA references across twenty cases. Most of these references come from BASF’s Eco-Efficiency Analysis Label, Green Format, SMaRT Certified, and (p.120) the level BIFMA Sustainable Furniture Standard, all of which directly integrate life cycle analysis principles into the design of their assessment frameworks. As BASF states, “Life Cycle Inventories and Life Cycle Assessments form the basis of every Eco-Efficiency Analysis.” Six other cases (CERES-ACCA Sustainability Reporting Awards, B Corporation, Pacific Sustainability Index, Climate Counts, U.S. Beverage Container Recycling Scorecard, the Sustainable Forestry Initiative, and Green Globes) include specific criteria in their assessments that relate to LCA, although it is not a primary feature of their design. For example, Green Globes Design of New Buildings certification gives points for “conducting a Life Cycle Assessment of the building assemblies and materials.” Two cases mention LCA as an optional way to meet one of their criteria, while another two cases mention it in the descriptions of the companies they have assessed, implying that LCA is a characteristic they are looking for in their assessments.

The second more specific aspect of validity that I would like to highlight is the weighting of criteria within assessment frameworks. Any certification or rating system that assesses more than one aspect of a product or company must make decisions about the relative importance of those aspects. These decisions involve both technical and values-based considerations, and initiatives should be transparent about how they make them. This is important not only because criteria weights are an important dimension of validity and can have a significant effect on evaluation outcomes, but also because they are necessary for replicating the results of these evaluations. Initiatives that fail to discuss or at least document their weightings have significant gaps in both validity and replicability. Following this logic, I therefore conducted a lexical search for all references to “weights” or “weighting(s),” and found eighty-eight references across thirteen initiatives. Two of these cases only provide a brief mention of weightings or list the criteria weights, while the rest provide at least some description of the logic behind the prioritization process. For example, the SkinDeep Cosmetics Database explains that higher weighting factors were assigned to “categories of health concern for which studies provide evidence for effects at low doses, for permanent effects stemming from exposures during development, [and] for toxicity endpoints that tend to impact multiple biological systems in the body or to impair reproduction.”

An additional theme found throughout both the validity and accuracy references are caveats and disclaimers about the information being (p.121) provided. One interesting example comes from the Leonardo Academy’s Cleaner and Greener certification. It highlights a trade-off between accuracy and costs, and in the case of incorporating imports and exports into its building emission footprint calculation, states that that the “possible increased accuracy to the emission factors does not justify the additional workload necessary.” Eighteen other initiatives make more straightforward disclaimers (often on their Terms and Conditions pages) about the accuracy or validity of the information they are providing. For example, SCS, in its description of its Indoor Air Quality Certification Program, states that it “does not make any warranty (express or implied) or assume any liability or responsibility to the user, reader or other third party, for the accuracy, completeness or use of, or reliance on, any information contained within this program.”

The Information Quality Landscape

The data presented above on the replicability, reliability, and validity of the 245 cases in the EEPAC Dataset can be brought together to provide a snapshot of the information quality landscape associated with information-based environmental governance strategies. Figure 4.6 presents such a snapshot. The diagonal axis measures the cases’ level of transparency, combining data from figures 4.2 and 4.3. If a case has at least limited levels of all four types of transparency (criteria, source, data, and method transparency) discussed earlier, it would receive a score of 4 and be in the bottom row of cases in figure 4.6. If it is not transparent on any of these dimensions, then it would receive a score of 0 and be in the top row. The horizontal axis, on the other hand, combines the data from figures 4.4 and 4.5 on the reliability and validity of the cases. The first, left-most column includes cases that have no references to either their reliability or validity, while the fourth, right-most column includes cases that discuss both their reliability and validity. The second column from the left includes cases that provide some mention of their validity but none of their reliability, while the third column includes cases that provide some discussion of their reliability but none of their reliability.

Figure 4.6 shows that the extent of transparency and discussions of information quality vary widely across this landscape. The seven cases in the upper left corner—the World’s Most Sustainable and Ethical Oil Companies, the Big Green Purse, EcoChoices, EcoMall, Green Culture, Green (p.122)

Measuring Green: The Generation of the Information

Figure 4.6 The information quality landscape.

Note: Each column represents the number of cases that fall into the specified category. Thus, the seven cases in the upper right corner neither discuss their validity or reliability nor are transparent about any of the four methodological factors coded for—their criteria, sources, data, or methods. Likewise, the five cases found in the lower left corner discuss some aspect of both their reliability and validity and have at least limited transparency across all four of these methodological factors. The “level of transparency” axis is a count of these four types of transparency found in each case.

Options, and EnergyGuide—neither discuss their validity or reliability nor are transparent about any of the methodological factors that I coded for. On the other hand, the five cases found in the lower right corner—the GreenSpec Directory, the College Sustainability Report Card, the Sustainable Forestry Initiative, GoodGuide, and Seafood Watch—do describe their validity and reliability and are at least somewhat transparent about their criteria, methods, sources, and data.

Overall, figure 4.6 shows that the majority of initiatives (nearly 70 percent) do not directly discuss either their reliability or validity, and have only limited transparency into their methods, criteria, data, and sources. While these concepts are inherently difficult to measure and this graphic is a simplification of the data presented earlier in the chapter, it nevertheless (p.123) reveals that these programs generally lack a sophisticated approach to information quality assurance. Even the cases that do mention their validity or reliability usually do so in limited and unspecific ways.

The Information Realism Perspective

For commentators who are skeptical of information-based governance strategies, this landscape of information quality will likely reinforce their concerns expressed in earlier chapters. Not only are the majority of these initiatives overly narrow, lacking in independence, unsupported by appropriate expertise, and opaque about their accountability relationships, they also lack fundamental quality assurance and control mechanisms. They generally are not replicable—60 percent of the cases do not reveal the sources of their information and 75 percent do not provide any information whatsoever about their assessment methods. If we do not know where the initiatives are getting their data from or how they are conducting their analyses, why should we trust their results? Likewise, 78 percent do not discuss their reliability, and 83 percent do not discuss their validity. Over three quarters of these cases therefore are not providing any assurance that their metrics of sustainability performance are measuring what they claim to be measuring and that they generate similar results each time they are used to measure something.

More specifically, the vast majority of these cases do not mention the age of their data and criteria or any process for updating them. Only 8 percent incorporate principles or methods based on the most sophisticated method for evaluating the environmental impacts of products and companies—life cycle analysis—into their criteria. Only 5 percent describe how they weight the different criteria in their evaluations. Information pessimists might reasonably suggest that such lack of methodological rigor is summed up by eighteen initiatives’ telling statements disclaiming any responsibility for the accuracy, reliability, or validity of the information they provide. If these initiatives will not stand behind their results, why should the rest of us pay any attention to them? How do we know what they are pedaling is not misinformation at best and disinformation at worst?

Observers who are more optimistic about the potential of information-based governance strategies will likely respond very differently to the data presented in this chapter. They might view the transparency data as (p.124) generally encouraging, given that 78 percent of the cases provide at least limited information about their criteria, and 59 percent make at least some of their underlying data available to the public. The fact that one-third of the initiatives provide detailed and complete descriptions of their methods demonstrates that a large proportion of these ratings and certifications are likely to be replicable.

These information optimists might reasonably question whether lexical searches are themselves valid metrics of validity and reliability, given that they might miss cases that do not use the particular keywords searched for. They might dismiss the disclaimers about accuracy as legalese required by lawyers afraid of costly lawsuits that do not reflect the confidence of the designers in their evaluations. They might point to the limitations of life cycle analysis as a method of assessment—LCAs are expensive, require data that are not always available, and are not appropriate for every issue and topic of analysis. They might complain that the focus on different forms of reliability and validity reflects a disciplinary bias toward quantitative and statistical analyses, and privileges organizations with the resources and backgrounds that can conduct those analyses.

For the information realist, there is much to both agree with and contest on both sides of this debate. To the information optimist’s argument about the quality of the data presented in this chapter, validity, reliability, and replicability are inherently complex and multidimensional concepts that are indeed difficult to measure, and any measure of them will likely have a large degree of error. However, it is important to realize that such uncertainty means that the estimates provided in this chapter may be underestimating OR overestimating how valid, reliable, and replicable these cases are. On the one hand, more cases may be transparent than suggested in this analysis, but on the other hand, those cases coded as being transparent may not actually be that replicable. Likewise, I did not assess the quality of the reliability or validity claims that the initiatives make on their websites, only whether they make them at all. So more cases may discuss these aspects of their methods without explicitly using the search terms, but alternatively, fewer cases may demonstrate high levels of actual reliability or validity. Given that there is no clear reason to believe that the results are skewed in one way or another, it is reasonable to assume that they are relatively accurate estimates of these characteristics.

(p.125) As with the subjects of the earlier chapters, these estimates demonstrate to the information realist both the challenges and opportunities facing information-based governance strategies. On the one hand, their generally low levels of information quality and methodological sophistication raise important concerns about their usefulness to the public. On the other hand, some initiatives have demonstrated that it is possible for them to at least discuss the quality of the information they are providing with some level of sophistication. They show that it is realistic for these programs to engage questions relating to their replicability, reliability, and validity, and begin making use of more sophisticated forms of analysis and quality control. Regarding the complaint that such an emphasis reflects a quantitative or statistical bias, the quantitative bias is already embedded in many of these rating and certification programs, as they often involve numerical data in their assessments. Until critics or designers propose some other means to assure the public of the methodological rigor of these programs, it makes sense to use the concepts and standards of quality that are most commonly utilized to assess this kind of data.

As for the contention that data may not be available for certain types of life cycle analyses and that such analyses may not always be appropriate, this may indeed be the case. But given that LCA is the most accepted method for assessing sustainability performance, the onus is on evaluation organizations to explain why they are not using it. Just as engineers should have to justify why they are not using the most accurate and valid methods for building a bridge, so must the designers of these programs justify why they are not using the most accurate and valid methods for evaluating products and companies’ environmental impacts. Once they do so, then their audiences can determine if these are valid reasons for using an alternative evaluation method.

The argument that LCA (and other methods that increase information quality) is too expensive raises the important trade-off of cost and accuracy, which was also mentioned by the Cleaner and Greener certification. While no guarantee, the availability of both time and money can enable the use of more replicable, reliable, and valid evaluation methods. As the old adage says, “you get what you pay for.” This places some of the responsibility on the public and particular stakeholder groups to demand and be willing to invest in—either as policymakers, consumers, executives, philanthropists, or taxpayers—high-quality information. Just as we demand high-quality (p.126) bridges, we should demand high-quality sustainability information. But the evaluation organizations also need to offer and market such information to these groups, which this chapter clearly shows that the vast majority have not done.

Nevertheless, this result does not necessarily warrant giving up entirely on these initiatives, as the information pessimists might suggest. The cases that do discuss and defend the quality of their information and make the effort to update their data and criteria, document their weightings, and incorporate LCA into their processes can and should be recognized and rewarded for doing so. They provide guideposts for other initiatives that are committed to increasing the replicability, reliability, and validity of their evaluations, and a stark contrast to those that refuse to address their methodological deficiencies. The next section highlights further sources of inspiration and promising practices from both the academic literature and other sectors of society that have struggled with these same challenges.

In critiquing the quality of the information being produced by these information-based governance strategies, it is important to remember that these issues plague traditional governance approaches as well. Replicable, reliable, and valid measures of the effectiveness of environmental regulations are also few and far between, for example. Even the academic community, which we would assume should be the paragon of methodological virtue, has struggled with ensuring the quality of the data and conclusions that it produces. The information realist acknowledges these limitations of existing data across all of these domains, but rather than giving up on them recognizes that some of these data are nevertheless better than others. If we believe their underlying enterprise to be important and have value (conclusions that are discussed further in both chapter 2 and chapter 6), then the key is to incentivize them to improve by rewarding those that are using the best available methodological practices—and penalizing those that are not.

Promising and Problematic Practices

This chapter has suggested a number of important practices that evaluation organizations can follow to demonstrate to their audiences that they are providing high-quality information. These are also practices that people like Lynn can be looking for when they are evaluating competing (p.127) certifications and ratings. Lacking such practices constitutes a problematic practice itself, and should raise concerns for Lynn as she tries to identify the most valid, reliable, and replicable program for her institution. This section summarizes those practices and provides additional examples from both the academic literature and other domains.

Before diving into the specific areas of replicability, reliability, and validity, I have three general recommendations for certification and rating designers. The first relates to standards. Just as these organizations have been encouraging companies to report their corporate social responsibility performance in standardized formats and topics (through common sets of topics and questions such as the Global Reporting Initiative framework), they too should be reporting their own performance in a similar manner. Key components of that performance are measures of their metrics’ replicability, reliability, and validity, which are fundamental to any research report. As William Trochim explains, such reports should “include a brief description of your constructs and all measures that will be used to operationalize them. … For all [measures], you should briefly state how you will determine reliability and validity. … For reliability, you must describe the methods you used and report results. A brief discussion of how you have addressed construct validity is essential.” Specific measures of information quality, such as intra-class correlation and Kappa coefficients, should be reported whenever possible.

The second general suggestion relates to language and the use of the terms discussed in this chapter. Given that they are used in different ways by researchers and the public, it is important to clearly signal when you are discussing the reliability and validity of your data in statistical terms, and not just in the general sense of being trustworthy and credible. “Reliability,” for example, is often used in two distinct ways—one is in the statistical sense of consistency described earlier and one in the more colloquial sense of trustworthiness and credibility described in chapter 3.33 The general use of these terms is technically correct and can be found on many initiative websites as well as in the academic literature. For example, in their analysis of the “reliability” of product eco-labels as instruments of ensuring agricultural biodiversity, VU Utrecht University researcher Mariëtte Van Amstel and her colleagues focus on the capacity and independence of the evaluation organization and not specifically on the consistency of their results.34 Ideally, however, we should reserve these terms for their use in the (p.128) statistical sense, and use other words—like “trustworthy” and “credible”—for other uses. While such a distinction may seem tedious, it is a valuable signal to stakeholders and the public that a program has thought systematically about the quality of its information, which can serve to differentiate it from its competitors.

A third general recommendation relates to who is conducting all of these analyses. As discussed in chapter 3, expertise is a key signal of credibility, and it is reasonable to expect that someone who has experience and training in statistical analyses be the person behind the numbers being used in these sustainability evaluations. Evaluation organizations should therefore report not only how they are assuring their information’s quality but also who is doing that quality assurance. Given the complexity of these analyses, ideally such quality assurance is a multistep process involving multiple experts, some who are actually crunching the numbers and others who are reviewing them for errors. Such a process might involve both inside and outside reviewers; regardless, it should be clearly articulated to the public.

Replicability Practices

Specifically regarding the replicability of their results, environmental certifications and ratings should clearly disclose their data, data sources, evaluation criteria, and the methods by which they use these criteria to analyze their data and generate their information. Michael Sadowski and his colleagues at the consulting firm SustainAbility assert that this process of opening up their methodological “black boxes” builds trust and can increase (rather than undermine) the acceptance and use of sustainability ratings.35 An excellent example of such transparency comes from the world of academic publishing, which is increasingly requiring researchers to publicly disclose their data and methods when their articles are published. Nature journals, for example, require that their authors “make materials, data, code, and associated protocols promptly available to readers without undue qualifications,” preferably through public repositories.36 The American Journal of Political Science has a similar requirement, but has also contracted with the University of North Carolina’s Odum Institute for Research in Social Science to verify that the submitted replication materials do indeed produce the reported results.37

(p.129) These developments have aptly been described as the “new scientific revolution,”38 and are finding a large number of false positives in the literature.39 But as David Boockman and Joshua Kalla from UC Berkeley Political Science Department argue, the discovery of these errors “does not suggest scientists are especially prone to making mistakes. Rather, it shows that scientific errors are increasingly likely to be detected and corrected instead of being swept under the rug.”40 Similar protocols can and should be established in the realm of information-based environmental governance strategies. Initiatives could use a standard checklist or form to disclose how they have verified the replicability, reliability, and validity of their claims. Policymakers or other stakeholders could develop information quality indices that take into account the different forms of transparency, reliability, and validity discussed in this chapter. Ideally, they would take into account the results of efforts to replicate the results of these initiatives. These “metrics of metrics” would assist the public in evaluating the quality of the information these programs provide.

Reliability Practices

Reliability is an equally important attribute of environmental ratings and certifications. These programs should follow the same principles recommended by the Global Reporting Initiative (GRI) for corporate sustainability reports, which include comparability, accuracy, timeliness, clarity, and reliability. Specifically regarding reliability, the GRI specifies that organizations “should gather, record, compile, analyze and disclose information and processes used in the preparation of a report in a way that they can be subject to examination and that establishes the quality and materiality of the information.”41 Their methodological disclosures should have both general summaries for the public as well as more technical information for experts.

Organizations should also show how they have confirmed that their information is indeed reliable. Eugene Szwaijkowski from the University of Illinois Chicago and Raymond Figlewicz from the University of Michigan Dearborn provide a good example of a check of test-retest reliability in their study using the SOCRATES database developed by the socially responsible investing firm KLD Research & Analytics (KLD), and find that it does have an acceptable level of this form of reliability.42 Judith Walls, a professor at Nanyang Business School, led a study that provides a similar check of the (p.130) inter-rater reliability of a new metric of environmental strategy by having a second researcher code a random selection of reports and calculating the related Cronbach’s alpha.43 The goal of these tests of reliability are to quantify the level of precision—and conversely, the uncertainty—of the information that these initiatives are providing.

Along these lines, designers should also keep track of the sources of uncertainty in their analyses, and communicate their estimates of uncertainty and levels of confidence in their final evaluations. An excellent example of such a practice is the National Research Council’s assessment of research doctoral programs in the United States. Acknowledging the high levels of uncertainty in such an endeavor, the assessment includes two parallel rankings, one based on faculty members’ stated preferences for different characteristics of high-quality programs and one based on faculty members’ revealed preferences for such characteristics by their ranking of a sample of programs. Both rankings are presented as ranges of values—as opposed to specific ranks—that represent the middle 90 percent of a large number of ratings, and thus take into account the variability in the underlying data.44 Thus instead of being ranked #3 or #11 in the country, Princeton’s Economics Department rank is estimated as being between #4 and #11 or between #3 or #5, depending on which metric is used.45 Underlying this approach is an appreciation that the weights placed on different criteria can create major changes in rating results. In their analysis of three different methods to rank NFL teams, a team of researchers led by Tim Chartier, a professor of mathematics at Davidson College, demonstrated that this effect can occur even in rankings based on well-established linear algebra models.46 Thus it is critical to assess the sensitivity and uncertainty of rating systems based on weightings that are often much less systematically derived.

As the research in this chapter shows, none of the cases employ all of these promising practices, and thus all of their data has limited reliability. Nevertheless, it is important to reiterate that this is a problem that plagues many other domains as well. As AccountAbility staff members Nicole Dando and Tracey Swift explain, financial auditors, even after decades of experience, still remain “unable to guarantee the robustness and reliability of financial accounting and reporting and to impart public confidence.”47 Social, ethical, and environmental accounting is in its infancy and it is not surprising that it is facing similar challenges.

(p.131) Validity Practices

These challenges extend to the area of initiatives and organizations ensuring the validity of their data. However, numerous opportunities exist for them to improve their performance in this area as well. One way for these initiatives to increase their validity is to not make unqualified general claims or exaggerate what they are measuring. As the Federal Trade Commission (FTC) states in its Green Guides, “marketers should not make unqualified general environmental benefit claims … [because they] are difficult to interpret and … likely convey that the product, package, or service has specific and far-reaching environmental benefits” that marketers will not be able to substantiate. Such unqualified general claims are easy for consumers, competitors, and the FTC to identify as lacking validity. Claims that are based on methods and criteria that are out of date also can be quickly dismissed as having limited validity.

It is also important for evaluation organizations to be as comprehensive in their assessments as possible. Following the logic of the values discussion in chapter 2 and the principles of life cycle analysis discussed in this chapter, ratings and labels that are unduly narrow risk obscuring trade-offs among performance criteria. In a study of higher-education sustainability ratings, Nick Wilder and I show that these initiatives have significant gaps in coverage of key environmental and social issues.48 Nevertheless, two cases—STARS and the Pacific Sustainability Index—cover a broader range of criteria than the others and have greater validity as a result. An example from academia further demonstrates the problem of narrowly-defined rating systems. Joel Baum, a management professor at the University of Toronto, shows that the design of the Impact Factor rating of journal article citations enables both articles and journals to “free-ride” on a few highly cited articles, and concludes that the metric has “little credibility” as a measure of publication quality, despite its widespread use. The use of food miles as a metric of food sustainability is another example—a report for the UK’s Department of Environment, Food, and Rural Affairs concluded that “a single indicator based on total food kilometres is an inadequate indicator of sustainability [because] the impacts of food transport are complex, and involve many trade-offs between different factors.”49 In order to have high validity, product and company sustainability assessments must take a holistic approach to evaluate all stages in the life cycle, from their raw materials to their end-of-life. And following the preceding (p.132) discussion, those stages and their associated criteria need to be weighted appropriately.

The validity of these metrics not only depends on their construction but also on how they are measured. First of all, they must be based on relatively up-to-date data sources. Second, they should avoid possible threshold effects in the data collection process. Lori Bennear at Duke University shows that the exemptions from reporting requirements for companies that use a chemical below a certain threshold level can account for up to 40 percent of observed declines in toxic release emissions. Furthermore, rankings of the highest- and lowest-emitting facilities can change significantly due to this threshold effect.50 Third, they should consider the accuracy of user-generated data, which is a key element of many of the cases in my dataset. Beyond questions about whether these users have the relevant expertise to evaluate the sustainability of products and companies, Tim Chartier and his colleagues reveal another limitation of ratings based on such data.51 Ratings based on a small number of user contributions can be misleading and unstable, as each user can have a large impact on the average score. Programs that make use of such user-generated data should explore linear algebra approaches (such as the Colley method) that compensate for this effect.52

Other studies demonstrate how the validity of sustainability ratings and certifications can be assessed by comparing them to other related metrics. For example, Noushi Rahman at Pace University and Corinne Post at Lehigh University compare KLD’s environmental ratings to their own proposed measure of environmental corporate social responsibility to confirm the validity of their new metric.53 In earlier work, Aaron Chatterji and his colleagues test both the retrospective and predictive validity of KLD’s environmental ratings by comparing them to corporate data on environmental fines and toxic releases.54 They find that companies with high environmental concern scores generally had poor environmental performance in the past and end up producing more pollution and having more compliance violations in the future. However, no significant relationship exists between KLD’s environmental strength score and future environmental performance data.

As I discuss in chapter 6, showing that information-based environmental strategies have a positive effect on such environmental outcomes is one of the key measures of their effectiveness. When trying to show such an effect, it is important to establish that the relationship is causal and not (p.133) just a random correlation. In an innovative study using propensity score matching to demonstrate such causality, Resources for the Future researchers Allen Blackman and Maria Naranjo test the validity of organic coffee certifications by comparing the environmental impacts of Costa Rican farms that received organic certification with similar farms that did not.55 Given that the only characteristic that systematically differs between the two sets of farms is their organic certification, they find that the certification does indeed improve the environmental performance of coffee production, both in terms of reducing chemical inputs and increasing use of lower-impact management practices.

Programs can also use less-statistical mechanisms to enhance both the validity and reliability of their information, such as peer reviews and audits. They can encourage these processes to incorporate local expertise and have some level of regional specificity to them, as Rainforest Alliance does. However, this point raises an important trade-off for these metrics. The more locally—and regionally—oriented a certification or rating is, the more it risks losing its global comparability and reliability. Other trade-offs exist as well—between cost and quality, between depth and breadth, between validity and reliability, among the mix of input, output, and outcome metrics, between the need to update criteria and the importance of maintaining consistent metrics over time, and between the value of transparency and the risk of companies attempting to game the system. The validity-reliability trade-off is particularly important—a battery of questions that a company can easily answer may be highly reliable, but the questions may not get at the more complex but critical aspects of sustainability.56

These trade-offs are admittedly challenging to manage. The first step is to identify and be honest about them, both internally and externally. The next step is evaluate them in the context of the evaluation taking place—what does the current landscape of similar ratings look like? Is a more replicable, valid, or reliable metric necessary? Can a strong case for the need for higher-quality information be made? Who are the key stakeholders, and what are they demanding? What are they willing to pay for? The final step is to decide how to manage those trade-offs, and develop the highest-quality metrics that the particular context calls for and allows. While CoValence’s approach of treating all of its data sources as equally reliable counts as one of the most problematic practices in terms of its basic validity, CoValence at least is being open about how it has managed this (p.134) particular reliability/validity trade-off. For CoValence, the validity benefits of evaluating the reliability of its sources do not warrant the associated costs and reliability concerns with such source reliability assessments. It is then up to the public and stakeholder groups to evaluate how these ratings have managed these trade-offs, and provide grants of legitimacy—as discussed in chapter 3—to those they evaluate as providing the level of information quality they demand.

The Quality of Green Building Information

So how does this help Lynn in her evaluation of green building certifications? Armed with the framework presented in this chapter, she can use it to compare the different programs. First, she can assess their replicability by analyzing their transparency. All four of the main green certification programs for commercial buildings in the United States—LEED, Green Globes, Living Building Challenge, and ENERGY STAR—are relatively transparent about their criteria, methods, and data sources. However, the transparency of their data is more limited and varied. None of the programs publicly reveal the underlying data upon which the certifications of individual buildings is based. LEED provides a scorecard of all of the individual points that each of its certified buildings has earned, while ENERGY STAR publicizes the 1–100 ENERGY STAR score that each of its certified buildings has received.57 Green Globes only provides the date of certification and the number of globes earned by each of its certified buildings, while the Living Building Challenge only provides its buildings’ certification status and type (NetZero Energy, Petal, Full Living, and Living Community Challenge).58

The replicability of all of these programs is therefore limited, although they have made a commendable effort to publicize their methods and criteria. Their reliability and validity, however, are much more uncertain. The Living Building Challenge employs a technical director who is responsible for overseeing “certification and technical consistency,”59 and LEED has Technical Advisory Groups that review the consistency and technical rigor of credits and prerequisites.60 Reliability and validity are mentioned in an ad hoc fashion in many of LEED’s point interpretation sections. ENERGY STAR emphasizes the importance of consistency in energy benchmarking exercises and its requirement that all types and amounts of energy used by (p.135) the property must be documented for twelve consecutive months.61 Green Globes asserts that it “sets the standard for accuracy, consistency, and credibility,” and claims its lack of prerequisites, recognition of nonapplicable criteria, and incorporation of partial credit “results in the highest possible accuracy of the final Green Globes score and rating.”62 However, the organization provides no evidence on its website to support these claims; indeed, none of these initiatives provide any evidence of the overall validity or reliability of their certifications.

The four programs differ significantly in their practices regarding data updates. Green Globes has no requirements for recertification, while only one of the LEED certifications (for Operations and Maintenance) requires buildings be recertified every five years.63 The Living Building Challenge conducts its final audit at least twelve months after construction has been completed, but has no further updating requirements.64 ENERGY STAR requires annual recertification, although the logo displayed on certified buildings does not indicate the years certified.65 In contrast to the failure of these programs to provide up-to-date data about the buildings they have certified, all of the programs have updated their evaluation criteria relatively recently—LEED 4.0 in 2013, ENERGY STAR for New Homes in 2010, Living Building Challenge 3.0 in 2014, and Green Globes New Construction v.2 in 2013.66

Insights from Academic, Government, and Other Sources

Regardless of these recent updates, however, these programs are not providing much, if any, persuasive evidence that they are providing high-quality information about the sustainability of the buildings they are certifying. While recognizing the updating requirement of ENERGY STAR and the data transparency of both ENERGY STAR and LEED, Lynn turns to other sources that have attempted to evaluate and compare the validity of the different programs as metrics of building sustainability. Several online articles compare Green Globes and LEED, and conclude that the former is more user friendly, while the latter is more stringent.67 BuildingGreen’s Tristan Roberts and Paula Melton find that LEED is stronger on site sustainability, energy performance and renewables, water efficiency, materials and resources, ventilation, daylighting, product emissions, regional bonus points, and rewards for innovation.68 In a review of all four programs, Sustainable Performance Solutions’ Lawrence Clark concludes that the Living (p.136) Building Challenge is “the built environment’s most rigorous performance standard,” and concurs with others that LEED and Green Globes are about doing less harm while the LBC is about doing good.69

A few scholars have assessed competing green building certifications with regard to their coverage of a particular environmental performance area, such as indoor air quality or energy efficiency. For example, Wenjuan Wei and her colleagues at the University of Paris examine the extent to which different aspects of indoor air quality (IAQ) are measured by thirty-one different green building certifications, and find that IAQ metrics make up between 8.2 percent and 9.1 percent of LEED’s rating system.70 Other researchers have focused on establishing the validity of LEED in particular. Oberlin College’s John Scofield found that a sample of LEED-certified buildings in New York City had similar greenhouse gas emissions and energy costs to comparable noncertified buildings.71 More specifically, gold-certified buildings had lower emissions and costs while silver-certified and certified buildings had higher emissions than comparable buildings. UC Santa Barbara’s Sangwon Suh and his colleagues found that the difference in life-cycle environmental impacts of LEED-certified vs. noncertified buildings can range from 0 to 25 percent, depending on which LEED points are earned.72 They estimate that the most significant impact reductions are in the categories of acidification and human respiratory health, and the occupancy phase of buildings accounts for the vast majority of the environmental impacts in ten of the twelve impact categories.

While these energy and life cycle studies are revealing, they do not compare the different certifications and thus have limited value in evaluating their relative validity. The data on energy use by LEED-certified buildings are concerning, but I could find no similar studies for Green Globes and so that certification program could face a similar challenge to its validity. This is less of an issue for the Living Building Challenge, which requires net-positive renewable energy use and thus should by definition have improved energy performance, and ENERGY STAR, which requires certified buildings to demonstrate a reduced amount of energy use. However, the lack of environmental performance tracking over time by LEED, LBC, and Green Globes also reduces their validity as metrics of environmental impact.

In its regular review of green building certifications, the U.S. General Services Administration (GSA) systematically compares the criteria and (p.137) methods of Green Globes, LEED, and the Living Building Challenge against the minimum sustainability requirements for federal buildings. The Energy Independence and Security Act of 2007 (EISA) requires that the GSA identify a green building certification system that it “deems to be most likely to encourage a comprehensive and environmentally sound approach to certification of green buildings.”73 The GSA’s 2012 report finds that none of the three systems cover all of the federal requirements: Green Globes aligns with the most requirements at some level (twenty-five out of twenty-seven) while the Living Building Challenge meets the most outright (twelve out of twenty-seven).74 In 2013, the GSA recommended that agencies use either LEED or Green Globes, but not the LBC.75

The GSA’s technical advisory group, while recognizing LBC’s laudable focus on sustainability performance as opposed to technical standards, had concluded that it “did not align well with Federal requirements [because] it does not specify how to meet its performance requirements.”76 This decision not to recommend the Living Building Challenge reflects the command-and-control nature of the federal standards. If a building demonstrably achieves net-zero energy use, for example, who cares if it did so following the federal government’s precise requirements? Granted that LBC does not cover some substantive federal criteria (as LEED and Green Globes also do not), but those it does address are covered more deeply than any of the other systems.

The federal evaluation criteria also focus on whether the certification was developed through an inclusive, moderated, and consensus-based process.77 The Living Building Challenge was developed by the International Living Building Institute to define “the most advanced measure of sustainability in the built environment possible” and it did not systematically include all stakeholders in this process.78 As discussed in chapter 3, this focus on inclusivity may not be important to all audiences (and may actually undermine the development of cutting-edge initiatives), and it is inappropriate for the federal government to use it to evaluate validity and exclude a particular certification system.

While the federal guidelines are designed to identify comprehensive certifications, they do not focus on life cycle analysis as a means to encourage such comprehensiveness. Green Globes, LEED, and LBC in this context are more valid than the federal guidelines, as they each incorporate LCA into their design and criteria. While further progress can be made, these efforts (p.138) represent the most innovative practices in this area for ratings and certifications more generally. LEED grants three points for building projects that “conduct a life cycle assessment of the project’s structure and enclosure that demonstrates a minimum of 10 percent reduction, compared with a baseline building, in at least three of the six impact categories listed …, one of which must be global warming potential.”79 Green Globes grants 33 points for building projects that use an approved LCA impact estimation method “to evaluate a minimum of two different core and shell designs” that results “in selection of the building core and shell with the least anticipated environmental impact.”80

LEED also grants up to six points and Green Globes grants up to twenty points to projects that use products with environmental product declarations that incorporate life cycle assessments. Furthermore, LEED used life cycle analysis in determining the weights of the different criteria and the allocation of the 100 points in the system.81 Knowing this, Lynn may be less concerned about architects going for the easiest points and LEED’s credits for building components such as bike racks and alternative vehicle parking spaces. If points are weighted appropriately, then there is nothing wrong with architects choosing the most appropriate credits for their situation (just as football coaches must choose between field goals and touchdowns). It is unclear from its website how Green Globes determined its criteria weights, while the Living Building Challenge is an “all or nothing” approach that does not use weighted criteria.82 Two comparative analyses show that Green Globes places more weight on energy-related criteria, while LEED has a more even distribution of weights across different impact categories. Nevertheless, both studies conclude that their category weights are generally similar.83

Connecting Information Quality Considerations with Institutional Interests

Thus from this review of the literature, Lynn decides that all four programs leave something to be desired in terms of their information quality. None of them provide any evidence of their reliability or validity. However, they meet at least a basic standard of transparency, and compared to certifications and ratings in other domains, the quality of their documentation and explanation of their methods is relatively high. Given this relative parity in information quality, what should Lynn recommend? Should her university (p.139) fall back on cost as the determining factor, as some of her colleagues are suggesting? Cost is indeed an important factor, and as mentioned earlier, may be inversely related to the quality and validity of certifications. The relative costs of these certifications are complex and difficult to compare, as they involve the direct application costs as well as the indirect costs associated with consultants and necessary architectural changes.

In an effort to take this complexity into account, Jeffrey Beard concludes that Green Globes is significantly faster and less expensive than LEED.84 Similarly, a study of two comparable dormitories built on the University of North Carolina, Charlotte campus—one certified by LEED and the other by Green Globes—reveals that LEED’s direct costs were $640 less than those for Green Globes.85 However, the architecture and engineering service costs required for LEED were four times as much as those for Green Globes, making the total cost of LEED certification nearly $42,360 more expensive.

It is important to distinguish between the potential causes of this price difference. It may be due to differences in validity, reliability, or operational efficiency, but it may also be due to differences in the rigor and difficulty of the different standards. Two certifications may be equally valid, but one may have a higher standard than the other. This distinction relates to the relationship between standards and effectiveness. A label that has a relatively low standard that certifies 50 percent of a market may create more environmental benefits than one that has a relatively high standard that certifies only 10 percent of a market. This relationship is a complex one that we will return to in chapter 6, but for now the point is to distinguish between the quality of a certification’s metrics and the level of performance required on those metrics. This distinction points to a broader point about the potential value of having ratings that recognize different levels of performance, and this is the logic underlying the multi-tiered approaches of LEED and Green Globes (i.e., silver to platinum, 1–4 Globes). It is also a justification for the existence of multiple and competing certification systems.

So what certification Lynn recommends will depend as much on her evaluation of their quality as her sense of the importance of her university reaching a high level of sustainability performance. This relates back to chapter 2 and to what extent these certification programs relate to and activate Lynn’s values and the values of her institution. It will also depend on its available resources; budget constraints may limit its choices even if (p.140) a strong commitment to sustainability exists. In this sense, Lynn’s answer may be a series of questions directed back to her colleagues and the donors funding the new building’s construction about these different factors.

Given their answers to these questions, if Lynn’s university has a broad-based and deep commitment to environmental performance and confidence they can raise the money to pay for the extra up-front costs (some if not all of those costs will be recovered over time), she might recommend they go with the Living Building Challenge. If her university highly values methodological rigor and clarity, financial savings, and greenhouse gas and air pollution reductions but has more limited resources and a less holistic commitment to sustainability, then she might suggest ENERGY STAR. If it has an intermediate level of commitment to holistic environmental performance and a moderate level of available resources, LEED may be the option she should propose. However, if its resources are more limited and its commitment to sustainability is growing beyond energy concerns but is still nascent, then she might recommend Green Globes, particularly if the alternative is no certification at all. Some of her colleagues have proposed that they avoid the costs of certification altogether and do their own sustainability branding for the building. Equipped with this chapter’s framework, Lynn should remind them that they will need to establish the replicability, reliability, and validity of these self-reported claims of sustainability, just as these different certifications should be doing.

On that note, the bottom line for these building certification programs—and information-based governance strategies in other domains as well—is that they need to establish the level of information quality they are providing. The challenges they face are embodied in the only general statement about reliability and accuracy found on the LEED website, which similar to the other cases cited earlier, is a disclaimer of responsibility.86 These initiatives need to move beyond such legalese and take responsibility for the quality of the information they are providing. These four cases are doing relatively well in terms of their transparency, but can improve significantly in other areas. Their audiences need to recognize the progress they have made, which has likely been motivated by their competition with each other, but also push them to continue to make improvements. Quality is unlike the “fairness” that the Queen and her Magic Mirror are so focused on, as it can be improved with intentionality and persistence. At whatever level of performance they focus on, these programs can improve and better (p.141) document the accuracy and consistency of their claims. They should conduct, cite, or fund studies that demonstrate their validity, reliability, and replicability, and use their results in their marketing and outreach as a point of competitive advantage.

So after examining their options closely and thoughtfully deliberating about their priorities, Lynn and her colleagues decided that as representatives of an institution that prides itself on being on the cutting edge of innovation and performance, they wanted to lead by example and design the new building be certified by the Green Building Challenge. The value of having a building that is certified as having “net positive” energy and water use and is actually environmentally regenerative was particularly compelling for them. They were also attracted to the strong validity of LBC’s life cycle focus and its requirement that the building’s environmental performance be measured twelve months after construction. While they acknowledged the strengths of the other programs, they felt that LBC was the most holistic and comprehensive of the competing options and best matched their organization’s values and sustainability aspirations. A certification from the LBC, in their eyes, represents the most valid and highest-quality information an institution can provide to the public about the environmental performance of its buildings.

With all of this focus on quality, it is important not to miss another key insight from the world of building certifications. One of the key advantages that commentators have emphasized about Green Globes is its greater ease of use. A certification can be the most trustworthy, salient, and valid program in the world, but if it is too difficult for anyone to figure out and appropriately utilize, it will inevitably fail. LEED has been responding to its critics in this regard, and its LEED Dynamic Plaque is one of the more exciting and engaging forms of certification to be developed in recent years.87 Designers of information-based governance strategies must make the information they are producing and disseminating attractive, intelligible, and usable for their different audiences, from consumers to architects to policy-makers. At the end of the day and even after Lynn’s in-depth analysis, the certification she and her colleagues selected was likely the one they most understood and were the most excited about using. For this reason, the usability of the information provided by these initiatives is the subject of the next chapter. (p.142)


(20.) An exception is the work of Auld and Gulbrandsen (“Transparency in Nonstate Certification,” 2010), which distinguishes between input and output transparency.

(22.) I use both representation and metric here because theoretically such operationalizations do not necessarily have to be numeric or even textual, but can be abstract and artistic as well. For the purposes of this chapter, however, I am focused primarily on the validity of sustainability metrics, and so will generally use that term.

(29.) The one-sided Fisher’s exact tests for these two results were 0.003 and 0.087, respectively.

(30.) This excludes references to the reliability or consistency of other programs or sources of information.

(31.) Full disclosure: I am a cofounder of GoodGuide and served as its first director of content development.

(33.) The Oxford English Dictionary provides two definitions of “reliability”—the first is focused on trustworthiness and confidence and maps to the more colloquial usage, and the second reflects a more statistical understanding of it as “the degree to which repeated measurements of the same subject under identical conditions yield consistent results,” which is the focus of this chapter (“Reliability, N.”).

(52.) Ibid. This is just one example of the insights available from the growing literature on the science and mathematics of ratings and rankings. While the technical details of this literature are beyond the scope of this chapter, an excellent starting point for further exploration is Langville and Meyer, Who’s #1?