Nick Diakopoulos is an assistant professor in communication studies and computer science at Northwestern University where he’s also the director of the Computational Journalism Lab. In addition, Diakopoulos the author of Automating the News: How Algorithms Are Rewriting the Media. Recently, it was announced that he is joining Jeremy Bowers’ team at The Washington Post to develop a computational political journalism R&D lab to help the Post’s election coverage. Diakopoulos joins It’s All Journalism host Michael O’Connell to discuss how algorithms are changing the way journalists are doing their jobs.
It’s All Journalism: When I get a guest on the podcast, who’s sort of coming from a mix of disciplines, I usually like to ask which came first? Journalism or computers or computer science?
Nick Diakopoulos: It’s such an interesting question. I mean, you know, I’ve been coding since I was probably about six-years old. I’m writing computer programs on the old Commodore 64 in Basic, but I also grew up in a household where my father was a journalist for a long time. So I guess I was exposed to both at a pretty young age. I studied computer engineering in college and then moved into computer science for my Ph.D. So that’s, that’s really where my formal training comes in. But I was certainly exposed to journalism at a young age and,worked at a magazine and high school and that kind of thing. So tying the two together kind of made a lot of sense for me.
IAJ: What inspired you to write this book about automating the news?
ND: When I was in grad school back at Georgia Tech, this would’ve been in 2006, I really started discussing with my advisor there what would it mean to kind of smash together computing and journalism and what would that look like? For the last, I guess 13 years, I’ve been sort of thinking about the intersection of those fields pretty deeply. And around 2016, it just all started to come together and it felt like the right time to start working on really kind of writing it down. What did it mean to do journalism computationally? What does it mean for journalism and what did it mean for computer science? Really kind of inspired to take some of the applied research that I’ve been thinking about over the years and frame it in a way that’s accessible to practitioners as well. So try to translate some of the results and findings in the research in a way that they can be useful for reporters and editors.
IAJ: OK. When we talk about computational journalism, what does that encompass?
ND: So in the book, I really kind of cover the whole gamut of what I think it encompasses. So that includes everything from data mining and machine learning to find stories and large data sets to automatically producing written articles or data visualizations, interactively generating content in the form of news bots or chat bots. But also things like how are algorithms used in news distribution. So if you look at something like a search engine or a feed or recommender system, how do those algorithms play into the way that people are exposed to news? And then finally I also think about and talk about this kind of the broader issue of algorithms in society and how journalists can start investigating and scrutinizing algorithms that are used now throughout government and throughout the private sector to contribute to sort of decision making processes. So when I talk about algorithms in journalism and computational journalism, I’m really kind of talking about all of those things. It is sort of a widely scoped field.
IAJ: What are some of the ways that algorithms are rewriting the media, changing the way that we cover the news and the way we report and present it?
ND: Yeah. So, I think it really, algorithms are changing all kinds of different facets of how the media is produced. Again, you know, from how the stories are found using algorithms to identify interesting patterns or happenings in datasets to the way stories are written. What I focus on in the book is how algorithms are changing news production, and in particular how people need to work together with algorithms and together with machine learning systems or data mining systems in kind of what I call hybridized workflows. So what are the things that, algorithms do well and what are the things that people do well and then how do you kind of marry those together into a productive work process, so that you get more than the sum of the pieces there.
IAJ: A lot of people, when they think about algorithms and media, they think about negative things like bots that are spreading fake news or trying to get your information. But I take it, you talk a lot about the benefits of what this type of computational journalism can actually provide and less about the downside or the negative side. Is that correct?
ND: Well, I’m an optimist. I’m trying to be optimistic about how this technology can really be useful for enhancing news production. So, I mean, every technology has some good and some bad, and algorithmic approaches to news production are sort of no different. I mean, on the plus side, I think news automation can help produce news more quickly. It can produce news on a wider scale or wider breadth of material. It can be used to personalize content to make it more relevant to individuals. And it can also help produce higher quality journalism. So, you know, when we talk about investigative journalism, the ability ability to use data mining and algorithmic approaches to dig through some of these large document corpora, that’s a real advantage and it allows journalists to discover more unique and original stories, and do sort of a more comprehensive job investigating these things than they otherwise could without algorithmic approaches. So that’s kind of on the, on the plus side of the ledger.
Of course there’s also disadvantages to automate a news production. Certainly, these techniques require a lot of data. They require clean data. And so if you’re working in a domain where data is not sort of available or if it’s often dirty data and requires a lot of cleaning, then that’s going to limit your use of automation. Sort of a broader point there is that if information is not quantified, if there’s some important piece of context for a story that’s just not, that hasn’t been digitized or hasn’t been quantified, that’s not going to be accessible to an algorithm.
And I would say another big issue with algorithmic approaches and, and AI in general is that they tend to be very brittle. Typically, these systems are engineered to work in a narrow domain, where it’s been sort of very carefully thought through and engineered. And that really ends up being a pretty substantial limitation, right? Because in the world of news, things are always changing. The world is a dynamic place. There’s a lot of different things that can go wrong in the world. And if you’re just using a sort of narrowly engineered piece of software that could easily break, if something out of bounds kind of kind of happens in the news. So obviously, advantages and disadvantages adhere to automated production technologies.
IAJ: Are there any projects or newsrooms that you’ve seen out there that you think are doing cutting edge or exceptional work with computational journalism?
ND: Yeah, absolutely. There’s a variety of examples in the book. One of the ones that I really like to point out is a story from a couple of years ago that was produced at the Atlanta Journal-Constitution newspaper, where they used a machine learning technique to help identify doctors in a large document dumps. So they collected like a 100,000 documents from various medical boards’ websites online. And then they use this machine learning technique to identify documents in that corpus that pointed to doctors that may have been involved in sexual misconduct cases. And, of course, this was a months-long effort. It certainly wasn’t the case that the machine learning acted on its own. I mean, there were some very smart journalists involved who built the system, but then also had to chase the leads that, the machine learning system produced. So, they took 100,000 documents down to maybe 6,000, but then they still had read through those documents and do all the reporting and look into these doctor’s backgrounds and so on, in order just to kind of finish that investigation. So that’s one example that I, that I like to point out of sort of as success of using data mining to find a story.
IAJ: It’s interesting. Just recently was working with, with someone and it was just something as simple as she wanted to create a Google map. And then once, as we were building this map and taking advantage of the data that’s part of the whole Google system that she realized that, she was putting together just a presentation about something that she saying. Then all the data that, that Google had was sort of come up into that. And, and so that actually made this little presentation that she was doing much more rich in content and much more useful to the reader. So, I think people, you know, journalists, I, you know, I know many journalists who are always looking for ways to enhance their stories sometimes in a very little way, but other times, sort of what you’re pointing to, this opportunity to dig deeper and find stories in the data to tell interesting stories in the data. There are just lots of opportunities I think to do that. What would you say to somebody who wants to sort of move their career more toward being, I guess, a computational journalist, somebody who writes more data rich stories maybe, uses a different types of algorithms or programs to assist them in their reporting?
ND: I think this get that kind of how the world is changing and how people need to kind of adapt and retrain themselves to be able to use all this technology that’s emerging. I think sort of the next generation of journalists or where this generation of journalists that wants to keep up, they’re going to need to learn how to be able to update and tweak and validate and supervise these automated and algorithmic systems. If we’re talking about a journalist who wants to use an automated writing tool, they might need to learn some data skills like how to work with structured data, how to work with formal logic, like “if then else” statements on that data. I think some fluency with the basics of computer programming, would probably be quite helpful for a lot of people. I’m not saying that everyone needs to be a software engineer, but we’re talking about some basics. You know, understanding how computer programs work, being able to write some simple computer programs I think goes a long way in helping someone think computationally. And what I mean by thinking computationally is being able to think in a way that you can set up a problem that you have in your reporting in a way that you can get the computer to help you with that problem. I don’t mean that people would actually think like computers, but it’s more thinking about how do I break this problem down in a way that I can get a computer to do a piece of this problem for me.
That’s some of the things that people can kind of learn and adapt to. I mean, another aspect of this would be kind of at the high end, learning advanced statistical techniques, machine learning techniques. Again, I wouldn’t claim that every journalist needs to learn this or it needs to know about this. But I think if you really want to kind of innovate or try, entirely new techniques or adapt some of the techniques that are being used in data science now, then I think developing some of these advanced data science skills could open up new avenues for journalists.
IAJ: Now I know in the past, having talked to lots of different journalists about their day to day working environment, often they find themselves in the situation that, “Hey, I want to do this type of story. I want to do that, use this new type of technology or something.” And it usually comes back onto them to sort of bring it to the newsroom. What can newsrooms do to sort of create an environment where the reporter doesn’t have to know everything. I mean, certainly it’s helpful to know how certain types of systems work and how to report and provide data to help support it. But what can newsrooms do to create that environment to where they can, across the team, pursue this type of reporting?
ND: A lot of what I’ve seen is that these computational journalism projects often times do happen in teams and you have maybe someone who is more expert in data science, paired up with someone who’s more expert in reporting techniques, and then maybe you even have a product person there or a project manager to kind of tie things together. These techniques, these computational techniques, they kind of demand that you bring together different skills on a team. And so I think, in terms of how to set that up in a newsroom, having cross functional teams that come together, in order to tackle particular projects I think is an interesting way to set things up. I think dropping some data scientists in a newsroom could kind of be interesting in the sense that those data scientists might see the typical reporting projects in a slightly different angle or through a different light. And they might have different ideas about how to approach those types of reporting projects. So that’s something to try and generally kind of creating some collisions between different skillsets and different ways of approaching problems, I think can be a productive, productive way for making progress here.
IAJ: I interviewed a couple of computational journalists from The Washington Post last year about this special project that they had done. And it was a whole team of people, a whole team of reporters, a team of tech people who were, basically taking these large data sets about, crime rates across the country and sort of putting them into a system where they could search for certain types of information and lay them across the maps and, then build ways to present them. And it was definitely a team effort. And then when you got to the end of it and, the presentation was there, you saw the richness of the whole project. But, that’s something that they worked on for a period of time. But anyway, speaking of The Washington Post, it was a recently announced that you’re going to be joining the Post’s new computational political journalism R&D lab. So what are your aspirations for that position and what the lab’s going to be doing in general?
ND: I’m really, really excited about this. I have a sabbatical coming up this fall, and when I started chatting with the Post, Jeremy Bowers there got excited about the potential to work together. So, I’m going to be joining up with him and working on this computational political journalism R&D team. And really what I’m hoping there is that we can start to catalyze some advances in the quality and the scope of elections coverage that the Post can produce next year. And, I come from a kind of an applied research background and I think bringing that perspective into dialogue with folks in the newsroom, folks from engineering teams at the Post. I think that’s gonna be really, really productive. We’re already kicking around a bunch of different ideas to think through how we might capitalize on automated and algorithmic news production. How to augment the capacities of journalists and reporters there and how to develop new, unique experiences for readers. And I’m also kind of thinking about how computational techniques are changing politics more broadly or how they could change coverage more broadly. You know, how’s this going to change the way reporters and editors need to cover elections? In terms of say, the way that algorithms are used to promote different different perspectives in the media or the way in which bots are used to push different ideas and so on. So I’m really kind of looking forward to exploring a lot of these ideas and collaborating with different editorial folks in advancing some of these ideas.
IAJ: I know we talked a little bit earlier about bots that are used to sort of spread fake news or to gather information, personal information and things like that. One of the things that sort of separates journalists from somebody just posting information online is that we have sort of these ethical standards that we apply to how we cover a story, how we report something and how we present it. How do we sort of maintain these standards in a computational environment? Are they mutually exclusive? Is it something, do we have to change our approach to journalism or do they just work together really well?
ND: There are some new challenges that computational approaches bring up for the ethics of journalism. I think we’re still in the pretty early stages. Ethics is, you know, something that gets negotiated over time as people think through, where to draw the lines of what is kind of appropriate versus inappropriate. I can sketch out a few of the areas that I think are kind of interesting ethically through this lens. So, you know, take for instance, machine learning. Machine learning, you know, can be used in a data mining context to identify an individual that is maybe newsworthy for some story. Maybe it’s some kind of statistical evidence about that individual, but with machine learning there’s always going to be some statistical uncertainty based on the method. And that means that journalists need to understand that ethical issues of using uncertain statistical evidence, with respect to the kinds of claims that they want to make in their stories. So if you want to make a public indictment of malfeasance, that maybe requires a higher standards of evidence than a data mining method can provide. And so, you know, that will in turn kind of mean that journalists need to do additional reporting on the leads that our data mined.
Another example would be sort of the use of predictive modeling in journalism. So this is the kind of stuff that we see from election models that 538 in the New York Times publish. There’s an interesting ethical question there that kind of gets to the issue of feedback loops and the potential for a public prediction to then impact social behavior and social reactions in some way. So, is it ethical for journalists to publish predictions on voting day that could impact voter turnout? Uh, I don’t know. I mean, that’s, that’s something that maybe we need to think about some more.
Finally, I think there’s sort of the question of transparency, when journalists use machine learning techniques or other types of opaque algorithmic methods in their journalistic process. So, how can you be transparent with these types of tools? I mean, I think that that transparency isn’t as an ethics and ethical goal that I think a lot of journalists share. Um, but the question is, well, how do you be, um, as transparent as possible when you’re using these outward techniques? Those are the ethical issues that come up.
And another ethical issue that comes up is kind of the labeling of automatically generated contents. The question is if you generate content automatically, does the public have a right to be notified that content was automatically generated? In the same way that you, you would byline an article to signal who is responsible for writing that article. Should we also byline articles, if they’re automatically produced? I think there are some emerging, kind of approaches there. I think most of the major outlets like Reuters and Bloomberg and AP that routinely use automatically generated articles, they’ve sort of arrived at wanting to label that content as such. But I don’t think that there has been a broader ethical conversation about the overall standards there. And that might be something for industry to kind of think about some more.
IAJ: When we’re talking about automatically generated articles, I think AP, like last year, they were experimenting with a certain sports stories that are like super basic. Here’s the box score, here’s who won and here’s who’s lost, rather than have a reporter spend time on that, generate those. So is that what you’re sort of talking about stories that, all the elements are pulled together by an algorithm and not so much with reporting?
ND: Yeah, that’s right. So, if an algorithm has kind of produced a story from end to end, shouldn’t we also label that as such, so that the audience understands where that content came from?
IAJ: So this is all really fascinating, and I appreciate you coming in to talk about it. Looking forward in the next five years, how do you see the new newsroom changing? How do you see computational journalism changing our industry?
ND: I think there’s some really exciting things to sketch out. Where I see this all heading. One is I think there’s some opportunities for news organizations to start developing their own algorithm and tools and technologies that better reflect editorial values that many journalists have. So, for instance, The Washington Post has been developing machine learning techniques to help them moderate comments on the site. By developing the technology on their own, they’ve sort of better been able to reflect editorial standards of how to moderate those comments. So they’re not just using some generic algorithm to do it. It’s really kind of a journalistic algorithm. I think this is a really interesting direction for journalists to think about. What are the journalistic values that they want to build into this algorithmic, automated AI future of media that’s emerging? Or, are they just going to be content to pull off the shelf, some generic technology that Google gives them, or that some other big tech company gives them? That’s one area.
I think another interesting area is just thinking more about this whole hybridization of work. How are journalists going to work effectively with algorithms and AI and what are the jobs of tomorrow going to look like? I think they’re going to be more and more hybridized, but I think we also need to think as managers and designers of this new hybrid labor, we need to consider the very human concerns of what does that work look like? Is it good for the journalists who were doing that work? We want to make sure that journalists don’t just become cogs in some bigger technological process where the algorithm is your boss and the algorithm told you, you have to call this person and get this piece of information. We want to avoid that. But I think it really kind of boils down to like, how do you design the work, so that it’s efficient but also satisfying for the people that work.