Episode 7 - NYC Citywide Public Data and Town & Gown Data Analytics

  • 00:00:18:09 - 00:00:49:00
    Jie Ren
    Hi, everyone. This is professor Jie Ren, welcome to my podcast, When Tech Meets Ed. Today I have a few guest speakers to join me to discuss this really interesting topic, which is data analytics to my heart, in the very special form of collaboration which is between universities and government, Town+Gown. So during this whole process students truly get to experience the data analytics in this context of real life.

    00:00:49:02 - 00:01:01:21
    Jie Ren
    And then they get to provide data driven solutions to provide insights for governments, operations, etc., etc.. So let's begin. So Will, why don’t you introduce yourself?

    00:01:01:23 - 00:01:28:14
    Will Hisao
    Okay. Hi. So, my name is Will Hisao. I serve as manager of data systems and solutions for the New York City Mayor's Office of Operations since 2015. My journey with this specific national project is amazing. And it actually began before my operations journey. So when I was in the New York City Department of Environmental and Protection.

    00:01:28:16 - 00:01:53:01
    Will Hisao
    So that's how I started. So, originally I started as a hands on developer for this particular project. I was a technical person, and over the last decade, I kind of evolve into managing the full data lifecycle from collection to storage management, processing to, visualization. So I stuck with this project for so long because I'm passionate about, the intersection for IT and public policy.

    00:01:53:03 - 00:02:07:05
    Will Hisao
    I believe the specific data set holds is key to understanding the New York City built, how new York City builds it’s future, and I love solving technical puzzles required to unlock the insight and, the user and then the information.

    00:02:07:07 - 00:02:31:14
    Jie Ren
    Oh, wonderful. So this is truly amazing that you and also Terri, who will be, another guest speaker for this one for this episode. You guys truly helped us, like, tremendously. I'm so grateful to that, especially you in terms of explaining the definitions of data. And also trying to explain the background of the entire data, collection and then kind of inspiring students in terms of analysis.

    00:02:31:16 - 00:02:44:22
    Jie Ren
    So, so I'm just curious, like, what was the motivation for the in New York City government to, make the data available and publicly, like, available to, you know, anyone to see as the dashboard.

    00:02:45:00 - 00:03:14:06
    Will Hisao
    Right. So, I think back in 2012, the motivation started from a compliance, and then to democratization. So in 2012, we were required, for deputy mayor to publish the New York City’s Capital Project, its budget and schedule to the public. And at the time, we were required to report on the project, size over $25 million, because, there's so many project out there.

    00:03:14:06 - 00:03:38:05
    Will Hisao
    And so we started with the threshold of $25 million. We published data. And then in 2020, there was a key turning point. There was a local law 37 in 2020 passed by city council that we are required to publish everything, all the integral parts of New York City. So that's how this project started. We, we made all of the data available for all the projects.

    00:03:38:07 - 00:03:43:01
    Will Hisao
    So that's, original, motivation and compliance. Yeah.

    00:03:43:01 - 00:04:13:00
    Jie Ren
    Oh, nice. So, from my experience interacting with Time+Gown people and also like, having, like, witnessing students like learning process in it. Then definitely the data sets are complex and huge. So I can imagine, like, if I were in your shoes. Right. And there could be some, like data challenges that you are witnessing. And then how did you handle these data challenges when you are putting the dashboard together and then navigating through.

    00:04:13:02 - 00:04:35:06
    Will Hisao
    So back in 2012, the biggest challenge, challenge at the time was, first data sale. And there's not a unified, model and location to, to, to put the data together. So first, data set, data silo, every at a time. Every managing agency has their own system or some agency don't require a system because they don't have that many project.

    00:04:35:08 - 00:05:08:20
    Will Hisao
    So start from there. We work with agency to build a data dictionary and to define the spec, make sure, we have unified, we have a universal standard. Then we centralize the location, we collect data, in, in, in the cadence like three times a year. And that's the first challenge. And then the, data model, the second challenge is data model, because, New York City had a project that, the budget and schedule, they’re not, these entities are not 1 to 1 relationship.

    00:05:08:22 - 00:05:26:00
    Will Hisao
    For example, a set of schedule may have five funding, funding code we call the FMS ID and vice versa. You might have one set schedule with five funding resources and or one funding resources, map to five different project.

    00:05:26:06 - 00:05:27:10
    Jie Ren
    Like money to money relationship

    00:05:27:12 - 00:05:43:22
    Will Hisao
    Right. I think that is a fundamental, data model that we, we implement to this, this project, which is, which is quite, an accomplishment in terms of how to explain this data to the public.

    00:05:44:00 - 00:06:01:12
    Jie Ren
    So talk about the data challenges and how you, dealt with that. And I can imagine that when New York City government is running the day to day operations, there must be a lot of tough data being generated. And then how did you guys decide to prioritize what data to publish first?

    00:06:01:14 - 00:06:28:09
    Will Hisao
    So we decided, we we started with the threshold because the data, scale was too big. And we started with a threshold of $25 million and more because there was very, labor intensive process, back in 2012. So, so it was, the 25 million with a threshold. But today, later, when the city council passed a law, the local law of 37, there was not a choice.

    00:06:28:09 - 00:06:40:19
    Will Hisao
    We just had to, do the automation. And, there was much of automation to reduce, labor intensive process and make it the entire data set to be transparent.

    00:06:40:21 - 00:06:59:19
    Jie Ren
    Okay. There's a lot of work in it, for sure. And then also, you mentioned about the data challenges. And then like, I can imagine, because again, like through my observation and interaction with students, analyzing and trying to understand data. Data is coming from different like, managing agencies. Right. And there are a lot of stakeholders involved.

    00:07:00:01 - 00:07:08:18
    Jie Ren
    And that when you are building this dashboard and your people are building this dashboard, how would you ensure like data quality and the consistency, etc.

    00:07:09:00 - 00:07:31:20
    Will Hisao
    So I think, the data set has two major parts. One is schedule, one is budget. So in terms of budget, New York City has a great, financial system. It's called FMS, Financial Management System. It has gold standard, its accounting system. It has all the data we needed so that that part was, less a challenge.

    00:07:32:02 - 00:07:56:06
    Will Hisao
    So, the second part, which is schedule. Schedule? Because, the data silo issue, we just talked about that every agency had their way making their data. So we build up ETL process, we collect data, some agency, if they don't have system, we use. We started with spreadsheet, so we build different, customized, pipeline to agency and to collect them.

    00:07:56:06 - 00:08:18:12
    Will Hisao
    And of course, before that we have this, Star Wars data dictionary. And I did a lot of back and forth. So the two parts of the data, schedule and budget and how and we, we collect both of them and we join them. We use a so called FMS ID which is, code for each budget line and joined with agency schedule that that's how we put them together.

    00:08:18:14 - 00:08:19:09
    Will Hisao
    Yes.

    00:08:19:11 - 00:08:49:07
    Jie Ren
    Thank you, Will. So my next question is kind of moving from the data part more towards the analysis part, because the data part is the starting point, even though there are like a lot of like challenges and etc., etc.. And now let's get to the analysis part. So from your background and your experiences you'll know how like how would you expect like individual analysts, including, for example, Fordham students this semester to generate insights that could help the government.

    00:08:49:07 - 00:08:53:02
    Jie Ren
    Like have you seen an examples and then and could you please share?

    00:08:53:04 - 00:09:19:09
    Will Hisao
    kay, so, since the dataset has been launched since 2023, the students in Fordham University, have provided what I call a deep dive R&D, research development, so while my team in the city government is to focus on the daily operations of keeping the data flow flowing, and student had a time to look at the data fresh, like, oh, why?

    00:09:19:09 - 00:09:25:18
    Will Hisao
    There are many why like how how why is data like this? How they looked at it and in a in a fresh, in a fresh view?

    00:09:25:20 - 00:09:30:11
    Jie Ren
    Yeah. I remember the engaging conversations right in class about. Yes. Please carry on.

    00:09:30:13 - 00:10:02:22
    Will Hisao
    Yeah, yeah. So, so in a recent class, student analyze our latest seven reporting periods and looked at the data and, and the and they actually came up with some questions in terms of how projects were delayed and need a good categorization. And I think that was, that was something that, a good finding that in class a student had started have the insights and I think we'd build the student, build a good model as we are growing in the future, because now we only have seven reporting period.

    00:10:03:00 - 00:10:18:03
    Will Hisao
    I think this model will be very, will be very, very helpful in the future. As it grows bigger, we'll be able to see some more and more insight and patterns and how this, data is helping the, the city government to operate.

    00:10:18:05 - 00:10:38:08
    Jie Ren
    Thanks, Will. So we talk about essentially the Town+Gown collaboration, which is between the university and also academia, right, for students to look at the data set. But this is such a wonderful data set, it could be very much accessible well it is accessible to the public. So anyone could be using their skill sets right, to analyze the data.

    00:10:38:10 - 00:10:45:03
    Jie Ren
    And then in your view, like what these skill set that they should have technical and also analytical.

    00:10:45:05 - 00:11:09:17
    Will Hisao
    So technical wise I think there's a hard skill, which is the requires data tools, for example SQL, SQL database or Python. These are very most popular tools in market. That's nonnegotiable because, you do need the tools to be able to handle millions of, of rows. Yes. And manipulate it, sort it, filter it.

    00:11:09:23 - 00:11:35:07
    Will Hisao
    So, but also thanks to AI the technology today. The and the barrier of entrance is a lot lower than before. So let's say someone never done never know this data tool before. Easily use AI to, generate a script or, a SQL language, for one to a query or analyze the data, for, for these big data sets.

    00:11:35:07 - 00:11:51:12
    Will Hisao
    So, in terms of skills, I feel like today, we have very easy to start, tools to use, which is AI but of course, the, the, the data tools, Python or SQL is, is still mandatory. That's my opinion.

    00:11:51:16 - 00:12:14:04
    Jie Ren
    Totally, totally agree 100%. So, maybe now let's like move to the outlook. Look ahead. Right. So, do you have any plans in terms of and also any vision in terms of any additional new features or new data, sets that you could be including into this dashboard that is public to, to the audience, I don't know.

    00:12:14:06 - 00:12:23:09
    Will Hisao
    Okay. So currently, I would say, my top wish it would be contextual interactivity. So for example, map.

    00:12:23:11 - 00:12:24:21
    Jie Ren
    Yes, yes, interactively.

    00:12:24:21 - 00:12:55:10
    Will Hisao
    Say it in the easy way. The map, currently our data set don't really have shapefile. It doesn't it do it does have community district. It does have rural or city wide. These categories im hoping in the future will be able to plan shapefile which allows the audience to easily, navigate to see a map. And how does it overlay other, data, for example, crime grade or average income.

    00:12:55:12 - 00:13:11:11
    Will Hisao
    Yeah. Population. So so that would be easy to, to really have a quick insight, more, more intuitive insight from the data set. So I'm hoping that would be something we can work on in the future, which is the, geospatial data.

    00:13:11:13 - 00:13:34:23
    Jie Ren
    Yeah. That's like super helpful because we know, like, a picture is worth a thousand words, right? So definitely the more visual the better. Also and comes to analysis, you could be using, like at least students are trying to do is to use additional datasets coming from different other sources, such as like, to look at like income, how to like associate income with the data that is that you guys are like making available now

    00:13:35:01 - 00:13:54:04
    Jie Ren
    if you guys kind of like merge everything together, that would be amazing. So very looking forward to that effort. So I have a final final question for you, which is like, in your vision, right? We know that this dashboard has been helping, at least Fordham students, right, to like try to generating set for government, operations.

    00:13:54:04 - 00:14:08:22
    Jie Ren
    And then in your vision, like how would you think about like how would this dashboard could, help generate, generate insights and also strengths and partnerships like tied down with other universities, for example?

    00:14:09:00 - 00:14:22:05
    Will Hisao
    Okay. I think it's all about feedback and feedback loop. So, we started with collecting the data and analyze data. Now we're moving beyond with students to create models that the city can adopt.

    00:14:22:08 - 00:14:23:05
    Jie Ren
    Yes.

    00:14:23:07 - 00:14:49:12
    Will Hisao
    So, using this dashboard, not just to report that, what has happened, but it's a but measure of success, of new policy creating the tighter loop between academic theory and government, practice. So, I would say, by using this data, started with student, analysis, we can we can tighten the gap and see how that, what impact, how does data work?

    00:14:49:12 - 00:14:53:14
    Will Hisao
    And, you know, the feedback for the government operations.

    00:14:53:16 - 00:15:13:10
    Jie Ren
    Yes. We are very looking forward like at its educator's. I'm very proud, like when my students are presenting this like their solutions. Right. And from our end, we truly want to know, like the feedback from the government, like from NYC government, to see oh did we really do a good job at like providing anything useful for the government.

    00:15:13:12 - 00:15:17:11
    Jie Ren
    So thank you so much Will for this conversation. I truly learned a lot.

    00:15:17:16 - 00:15:21:06
    Will Hisao
    Thank you, thank you, thank you for having me here. Thank you.

    00:15:21:07 - 00:15:40:09
    Jie Ren
    Hi. We are going to continue our conversation. And now I'm very happy to have three students of mine to join me to talk about their experience. At interacting with the data, made publicly available by the New York City government. So, Faris, why don't we start with you, and introduce yourself.

    00:15:40:10 - 00:16:07:16
    Faris Al-Dhahi
    Thank you for having me. My name is Faris Al-Dhahi, I am a first semester MSBA student here at Fordham, and I'm also the president of the Fordham Business Analytics Society. I have a background in international relations, where I did my undergrad at the London School of Economics, and now I transition to business analytics, and I'm very happy to be here in New York, the heart of the financial sector of the world.

    00:16:07:18 - 00:16:21:19
    Faris Al-Dhahi
    And, yeah, I'm now a part of this project dealing with the New York City project for project Tracker data. And, it's been a great experience, three months working on it and learned a lot. And happy to be here.

    00:16:21:21 - 00:16:48:15
    Jie Ren
    Oh that's wonderful. And then, like, I'm super impressed at Faris performance in class and also outside the classroom, for sure. And then he just joined us. So like totally like brand new fresh in terms of, like, data analysis, I know that you have some econ background, etc. and then, okay, so let's focus on this project. So what was your like biggest like experience coming out of this project?

    00:16:48:15 - 00:16:51:17
    Jie Ren
    Would you please share if you have to highlight one thing, 1 or 2 things.

    00:16:51:19 - 00:17:07:03
    Faris Al-Dhahi
    1 or 2 things. So coming into the project, I didn't really know what to expect, but getting the data set, it was a huge data set. And also the Terri mentioned that this was the first time people were analyzing the status.

    00:17:07:03 - 00:17:07:19
    Jie Ren
    Very exciting.

    00:17:07:19 - 00:17:37:23
    Faris Al-Dhahi
    And, it proved that because there was a long time cleaning the data, I'm still cleaning the data now, three months into the project. And, yeah, I learned a lot about having data organized. I think moving forward in the future, dealing with these kind of large data sets, it's very important to have data organized and structured because having, missing data, data that doesn't add or equal the other data sets is very challenging.

    00:17:37:23 - 00:17:43:14
    Faris Al-Dhahi
    And, I think that's probably the main, main thing I'm going to be taking forward with me.

    00:17:43:14 - 00:18:04:06
    Jie Ren
    No, 100% is not only specific to this context, but to like all context of data analysis. Right? Even like with like AI model training as well. Right with data. There's always thing about data quality. You have to like stress. I cannot stress enough about data quality and then how to organize data and etc. that part definitely takes a lot of time.

    00:18:04:08 - 00:18:27:02
    Jie Ren
    And then so let's like narrow it down because like, you know, the data has two different parts. One is for the infrastructure, the other one is for public buildings. And I know that you are and also your team members, are in, in charge of the data related to a public buildings. Right. And then, we also get the, the honor to, visit one public building construction site.

    00:18:27:03 - 00:18:42:02
    Jie Ren
    Right. Which, like, I am truly, grateful to, New York City government's arrangement for us. So could you please, like, share, share your observations and also your experiences at this, like, onsite construction? Yeah.

    00:18:42:02 - 00:19:02:06
    Faris Al-Dhahi
    So for me, actually, the visit to the construction site was probably one of the most important things from the project, because when we're in class analyzing this data, we're only looking at a bunch of numbers, and it doesn't really tell us much. But when you go there, you can ask questions to the project managers, which we did, he explained.

    00:19:02:06 - 00:19:23:19
    Faris Al-Dhahi
    For example, the reason for delay there was in a project, a waterproof generator that wasn't part of the scope and that caused delay, but that wasn’t recorded in the data. So it really gave us a better understanding also of the scheduling because they’re there we could see what schedule they're in and how much work goes into that schedule.

    00:19:23:19 - 00:19:37:07
    Faris Al-Dhahi
    So, I think the visit really helped us also now with creating the variables, it was a big factor of why we could pick the variables we have in the questions we're doing now for the presentation.

    00:19:37:09 - 00:20:03:14
    Jie Ren
    Oh that's amazing. Definitely. We need that domain knowledge. And yes, about the operation situations. Right of the government. And then okay, I have two more questions for you. So the next question is, we know from our end trying to learn from the students end trying to understand right the domain knowledge and then the context in order to come up with questions to address.

    00:20:03:20 - 00:20:07:06
    Jie Ren
    And then could you please elaborate on that whole process of ideation?

    00:20:07:08 - 00:20:34:05
    Faris Al-Dhahi
    Yeah. So when we started looking at what questions to ask and to analyze, I was looking at it as a story. I wanted to look at the project, that it's a story, and I'm explaining a story to the government and the New York City mayor's office. And basically also thinking in their perspective, what kind of variables and insight do they need moving forward?

    00:20:34:06 - 00:20:59:18
    Faris Al-Dhahi
    So, yeah, looking into the data, we broke it up into 3 or 4 different categories. We did the textual data. So that's all the project related data, the description, the agency related to each project. And then also the second category which was all the dates. So the scheduling and also we divide it into the money. So that's all the budgeting and the variance.

    00:20:59:18 - 00:21:15:11
    Faris Al-Dhahi
    And I think that's really helped us break down what questions we wanted to analyze. And yeah, we we were able to deduce roughly 30 questions to ask. And I think we're going to have a good presentation then. Yeah.

    00:21:15:11 - 00:21:35:14
    Jie Ren
    And I'm also looking forward to your wonderful presentation. I'm sure it's going to be wonderful. All right. So last question for you is truly, right. You have been working with the data, trying to understand the government's operations for almost one semester. So what have you learned there, like what's your understanding of the government's operations and the insight that it could provide for them.

    00:21:35:16 - 00:22:00:07
    Faris Al-Dhahi
    Yeah. So going in I didn't know it was so complex. The amount of agencies that are involved and also the communication that's needed between projects because projects are divided between managing and sponsoring agency. And a lot of the time these aren't the same agency. Yeah. So it was much more complex than I imagined. And a lot of different variables go into it.

    00:22:00:07 - 00:22:05:06
    Faris Al-Dhahi
    And I can imagine Will talked about because he's the one who came up with the data.

    00:22:05:07 - 00:22:07:09
    Jie Ren
    100% he’s our data person.

    00:22:07:13 - 00:22:24:16
    Faris Al-Dhahi
    He had a lot of work to put all this data together. And like, also has to, relay back to these agencies to talk about missing data, for example. And, yeah, it's a much more complex than I imagined. Starting off.

    00:22:24:18 - 00:22:26:17
    Jie Ren
    Yeah. I'm sure that you learned a lot of this semester.

    00:22:26:17 - 00:22:54:05
    Faris Al-Dhahi
    I learned a lot. And, I think this was very valuable. I was actually talking to some of my friends, and, I said, like, the classes are, of course, amazing. And, like, the professors teach me a lot, but I feel like being hands on and dealing with this project like it's a internship, really gave you a different perspective of how you need to act and what you need, what is expected from you.

    00:22:54:07 - 00:23:09:19
    Faris Al-Dhahi
    One thing I think that also I forgot to mention that moving forward is going to be really important for me is meeting deadlines and listening to all the instructions that are given to you. Yeah, because now this was a much more like real world experience.

    00:23:09:19 - 00:23:10:15
    Jie Ren
    Yes, exactly.

    00:23:10:15 - 00:23:15:14
    Faris Al-Dhahi
    So yeah, it's definitely helped me a lot for when I start my work.

    00:23:15:18 - 00:23:38:04
    Jie Ren
    Oh, wonderful. So I actually asked Will, this question like what type of skill sets. Right. And in his view could be essential to analysts analyzing any data, including the publicly available data. So and the he focuses very much on the technical part. But I totally agree with you that, social skills do matter, especially in the real world context.

    00:23:38:04 - 00:23:40:06
    Jie Ren
    [Faris Al-Dhahi: Yeah, exactly.] Thank you so much, Faris.

    00:23:40:07 - 00:23:43:07
    Faris Al-Dhahi
    Thank you for having me. Thank you, thank you.

    00:23:43:09 - 00:24:04:20
    Jie Ren
    All right, let's continue the conversation in terms of analyzing data for this special type of, constructions, which is public building. So here I have another student of mine that I'm so proud of. He's Kamrul from the second year, MSBA program. So Kamrul, please introduce yourself?

    00:24:04:22 - 00:24:17:06
    Kamrul Islam
    My name is Kamrul Islam. So I'm pursuing M.S. in business analytics. This is my final year. This year, end of this year, I will be going to be graduated.

    00:24:17:08 - 00:24:24:23
    Jie Ren
    Will be missing you for sure. We will be, like, missing you. Since like you are graduating. [Inaudible]. I hope that you'll be around, right?

    00:24:24:23 - 00:24:32:09
    Kamrul Islam
    Yeah. I also miss the programs I like my faculty use. And also the campus all as well.

    00:24:32:15 - 00:24:52:21
    Jie Ren
    Yeah. Yes. Okay. Do you. Okay. So, some time that, you are helping with, me right to run the MSBA program? I'm truly. I'm very grateful for that. And then, so, I just talked to Faris, right? Who is the first year student, and then I know that you and also Faris, you are on the team.

    00:24:52:21 - 00:25:14:22
    Jie Ren
    You are in the team for the like in charge of the data related to public buildings. Right? So I truly want to hear from your insights, given that you are the second year student, right. In terms of technical staff, you are like more, more experienced. So, so what technical and analytical skills that you use and your team used in order to tackle the data set, that could be quite complex.

    00:25:15:00 - 00:25:42:02
    Kamrul Islam
    Okay. So for the NYC Capital Project data set, there was a four data set. So at initially when I was discussing with the Paris Faris like facing a problem for to merge the four data set in the combining of one data set. So, initially I used SQL, like find out, one variable for figuring out, like as an identifier.

    00:25:42:04 - 00:26:22:18
    Kamrul Islam
    [Inaudible] key. So then once we selected that, how many, variables slept. So one by one, we just classify for each data set and then separated the identifier, key identifier unique identifier. And then put it in Excel, using the Excel and also simultaneously the Python as well. So yeah. So like we call the audio. So four data set in a Python, using a lot of files on libraries like numpy, pandas, for the data cleaning and data processing.

    00:26:22:19 - 00:26:50:17
    Kamrul Islam
    So and then like, once the data set was merged and like, I do like visualizations to share with the terri’s And Terri gave the feedback, do we appropriately merge the data or not? And then the she gave it that, she gave it, feedback about the, some of the data set was not FMS ID like key identifier on also the IDs.

    00:26:50:17 - 00:26:52:05
    Kamrul Islam
    Yes. Is missing.

    00:26:52:06 - 00:26:56:11
    Jie Ren
    Yeah. And then some data challenges. Yeah. Yeah 100%. But for any project.

    00:26:56:15 - 00:27:20:14
    Kamrul Islam
    Right. Of course. So then like, okay, this not it makes sense because every data has to be, primary key identifier, like you could miss ID and Fi ID. So then and we also start working on that. And finally, like a more than 80 or 98% are accurate, like, all cover the famous ID and fi IDs and rest of the rows.

    00:27:20:14 - 00:27:30:17
    Kamrul Islam
    She said, like, put in a separate data set, the data sheet, so we separate it. And Faris also help and others teams like my asterisk. They also have.

    00:27:30:19 - 00:27:31:19
    Jie Ren
    An entire team.

    00:27:31:21 - 00:27:35:21
    Kamrul Islam
    [Inaudible] I, I appreciated their energies.

    00:27:36:02 - 00:27:37:02
    Jie Ren
    Yeah, exactly.

    00:27:37:04 - 00:27:43:16
    Kamrul Islam
    So then the next challenge was like very complicated, like classify the public buildings or infrastructures.

    00:27:43:22 - 00:27:45:04
    Jie Ren
    Yeah. The topology. Yeah.

    00:27:45:04 - 00:27:46:03
    Kamrul Islam
    The topology for.

    00:27:46:03 - 00:27:46:22
    Jie Ren
    Classification.

    00:27:46:22 - 00:27:50:01
    Kamrul Islam
    Yeah, yeah. So like.

    00:27:50:03 - 00:27:52:04
    Jie Ren
    That needs domain knowledge, by the way. Yeah.

    00:27:52:05 - 00:27:52:18
    Kamrul Islam
    Right.

    00:27:52:19 - 00:27:53:14
    Jie Ren
    We have the lot.

    00:27:53:20 - 00:28:22:13
    Kamrul Islam
    Right. Domain knowledge. Yeah. So like, Terri gave like a classification part. I forgot this name of the date of part of the docs sheet. But it was said, like, if, DDC managed by the, the guess, is going to be public buildings are not going to be infrastructure. So there was a writing talks.

    00:28:22:15 - 00:28:41:15
    Kamrul Islam
    So, she gave it like a two weeks, and I thought that, okay, how can I do that? Like at two weeks? Is it is still a time like a or. It's not like it not makes sense as a we know that machine learning. We know this NLP stuff. Yeah. Why are we wasting a time. Yeah.

    00:28:41:18 - 00:29:02:14
    Kamrul Islam
    And then I like I like feature extractions. I did a lot of feature extractions. Like, one side is, managing as I see another side is sponsor as i see if this kind of managing has and see is sponsored by this, this kind of managing agency then it's going to be fee or like, infrastructure.

    00:29:02:14 - 00:29:36:20
    Kamrul Islam
    Yeah. If not something like, relate with the, project descriptions, what the project description is. Yeah. So initially we put in a, python and using the all of the coding staff one by one. So, like, from the descriptions we put in NLP stuff, all of this NLP algorithms, and then the classify public buildings and infrastructure, like more than 40 or 60 feature, for public building, 35 or 36.

    00:29:36:23 - 00:29:49:03
    Kamrul Islam
    For infrastructures. And then the code again, like, just like commanding if follow this kind of, feature, then it's going to be a public building if you follow this kind of, okay.

    00:29:49:03 - 00:29:50:06
    Jie Ren
    Yeah. So the classification.

    00:29:50:06 - 00:29:54:03
    Kamrul Islam
    Yeah. So most mostly focus on the classifications and. Yeah.

    00:29:54:05 - 00:29:57:02
    Jie Ren
    And how long did that take is not took to

    00:29:57:04 - 00:29:58:16
    Kamrul Islam
    Not two weeks like three days.

    00:29:58:16 - 00:30:35:11
    Jie Ren
    Yeah. Right. So that's the beauty of like exactly. Machine learning in terms of like that definitely could like shorten the entire process, which I'm like, very so proud to see, you know, my students are able to pull this off guys. So that definitely that has like shortened the whole process. All right. So we talk about the technical stuff and then also like if now like thinking, thinking again about the entire, process and then do you think that the data itself could inspire you and also your team to come up with questions because you have the different columns of the data, they represent different entities.

    00:30:35:13 - 00:30:42:20
    Jie Ren
    Do you think that part could, like, inspire you in terms of your ideation? Like how did you come up with these questions?

    00:30:42:22 - 00:31:07:03
    Kamrul Islam
    Okay, so actually these data is very interesting, but like, I always try to, match to, to make my task very easy way or more flexible way. Like, before anything else I do, I think about more than five minutes, like brainstorming. Okay.

    00:31:07:05 - 00:31:12:13
    Jie Ren
    That's the roadmap. I'll have to think. Yeah. Further, in order to make sure the plan is okay.

    00:31:12:15 - 00:31:39:15
    Kamrul Islam
    [Inaudible]. Right. So like here is my starting point. Here is my output. So which way we have to go and which kind of and you know, the reinforcement model like here is a like a cycle and go again the starting point and make a solutions and go again. Yeah. So we apply that for formulas. So in the data set we face a problems like after the classifications especially the [Inaudible].

    00:31:39:15 - 00:32:09:10
    Kamrul Islam
    Yes. Yeah. So like there are so many missing values. Especially the construction and design part for the design. Fine. And also forecast complete competitions and, and also actual data start. So the variations of actually the date. So I ask Faris, do you know that how do so figure out the team. And also. Yeah. So and also Amogha figure out some things.

    00:32:09:10 - 00:32:34:20
    Kamrul Islam
    And once she gave some of the part to analyze and identify the new variables, get back to me. And then Faris also. So we combine with each other and always try to keep touch with each other, make a group chat and everything stuffs and yeah. And then we finally did like yeah that’s a great experience. Yeah. And I love that.

    00:32:34:22 - 00:32:41:02
    Kamrul Islam
    Really interesting. So many variables. So I mean more than 100 k data.

    00:32:41:04 - 00:32:43:17
    Jie Ren
    Wow. Yeah. We know like it's very complex.

    00:32:43:17 - 00:32:45:18
    Kamrul Islam
    37 variables. So it's too much.

    00:32:45:18 - 00:33:05:08
    Jie Ren
    Yeah. It's a real life context. Yeah. That's very true. Yeah. Well thank you so much Kamrul for the wonderful insights. So by the technical stuff and also your understanding and then your ability to address this data set to generating sets. I'm so looking forward to your, final presentation in the city hall okay. Next week. So I'm so proud.

    00:33:05:14 - 00:33:06:09
    Jie Ren
    Thank you so much.

    00:33:06:11 - 00:33:09:13
    Kamrul Islam
    Thank you, professor.

    00:33:09:15 - 00:33:30:16
    Jie Ren
    Hi. We will continue our conversation by this time. I, I want to talk about the other type of constructions, which is infrastructure. And then I have a, student of mine who is also in the first year of the MSBA program Amogha, who is, part of the team in charge of this project related to this type of data.

    00:33:30:20 - 00:33:55:02
    Amogha Machinahalli Srikanta
    [Jie Ren: So, Amogha, why don't you introduce yourself?] Thank you for having me, professor. In, I my name is Amogha and I'm from India. I've come I'm a graduate student in, Fordham for master's in business analytics. And I my background is computer science. Engineering. I've done that in undergraduate. But I want you to explore how does that really work on real, like, data sets like business.

    00:33:55:02 - 00:34:19:07
    Amogha Machinahalli Srikanta
    So now that we're working on a public sector data sets. So that's really nice. And about me, I've really been like, eager to learn about how data can really give you business decisions. So I think that's something that I really like. And, I'm also, in Fordham. There's been like, really good opportunities, being this and then I'm also representing the Fordham business analytics society.

    00:34:19:09 - 00:34:44:11
    Jie Ren
    [Amogha Machinahalli Srikanta: So I'm really excited to be here today.] It's an asset. So nice to have you today. So I truly want to get your insights in terms of your understanding of the data representing infrastructure Faris and also Kamrul talk about the other type of, in, constructions, which is public buildings? And then you guys are for the whole semester, right trying to understand the infrastructures in New York City

    00:34:44:16 - 00:35:19:13
    Jie Ren
    Could you please start with that. Like what the data is like regarding the infrastructure. And we know that there is a on site visit. Yes. Right. About the construction site related to infrastructure. Could you please like, could talk about both. [Amogha: Okay. Yes. So firstly the infrastructure data set is, quite like most of the data is the data columns are similar, but, to figure out what that means is a little different from public buildings, because when you see public building, it's like you can see it, there's a horizontal, base, and then it just goes up on it.]

    00:35:19:15 - 00:35:39:03
    Amogha Machinahalli Srikanta
    Yeah, but in infrastructure, most of it is like underground or you can't see it. So, that's and like people around that are walking and there's like daily activities going on over there. So it can get like really difficult to do that I guess. Yeah. But infrastructure projects have been really good and the exposure is really good.

    00:35:39:05 - 00:36:04:12
    Amogha Machinahalli Srikanta
    We in this data set, we also, try to see, if there's any correlations between, like this, the area and the budget, and we check that there was nothing like that because each infrastructure project is different. For example, when we went into the site visit, the site visit over there, we saw that they're taking wires from underground because in Manhattan there's like really high rise buildings.

    00:36:04:14 - 00:36:20:13
    Amogha Machinahalli Srikanta
    So you can't really take it from on top and go. So they were trying to put it from underground and they had like a lot of, they had to consider a lot of things firstly about the how, you know, there'll be daily transportation going on over there about that and then about the people and then the telephone agencies.

    00:36:20:13 - 00:36:45:13
    Amogha Machinahalli Srikanta
    There was a lot of things to consider. So I think it's I think it's really different from public buildings. Having said that, I wouldn't say that this is difficult or that is easy, but yeah, yeah, [Jie Ren: they all have their uniqueness, that's for sure.] So, yeah, in the site visit, I think it was really good because we went there and then they told us they showed us the site first and then we were like, okay, we were working on this data and now I see it right here.

    00:36:45:15 - 00:37:01:21
    Amogha Machinahalli Srikanta
    And then that was really fascinating. And then they told us, we also had some questions. So we were like, do you think that like, this is something that can be a reason to predict delay? Like we were like, okay, so we have a predictive analysis. We want to predict how or if a project will be delayed or not.

    00:37:01:23 - 00:37:23:03
    Amogha Machinahalli Srikanta
    So we can ask them like is this any of the factors like maybe the agency is not responding or the contractors, something has changed or there are like weather conditions or like environmental things that we have to take into impact because everything is going underground and you can't really, see it. So. Yeah. [Jie Ren: So I'm so glad that you mentioned that, you guys are working on analysis, right?]

    00:37:23:03 - 00:37:47:00
    Jie Ren
    That has like two parts. One is the descriptive, the other ones predictive. Right. And you also mentioned the predictive analysis that you do. So could you please like, elaborate on that, that how you do these forecasting models. [Amogha: Okay. So for descriptive, firstly we, take the count of each and everything so that we have the base data set up like we can, for example, see the unique projects which comes from the unique identifier.]

    00:37:47:00 - 00:38:06:00
    Amogha Machinahalli Srikanta
    Like we take the FMS ID, we take the PID and then we take the reporting period. And then we get each unique project. So we get things like that in the descriptive analysis. And the typologies is because we recently made typologies as a new column into the data set, which was really nice. With [Inaudible].

    00:38:06:02 - 00:38:33:10
    Amogha Machinahalli Srikanta
    [Jie Ren: Yes.] Well, super. They helped a lot. So in that, Terri told us that which particular, projects would go to which, particular typology. So, so in that I performed a feature engineering on the data set using Python, and then, we, tried to put these, we searched for keywords, and it was like, if there was anything which had street or road, it would become a roadwork project.

    00:38:33:12 - 00:38:53:23
    Amogha Machinahalli Srikanta
    And if there was something else like, if there was a park that would become Parkland Infra because most of the infrastructure projects are park projects. So we had many of these like this. This was all of this was part of the descriptive analysis coming to the predictive analysis. We want to predict if a particular project is in the risk of falling into delay.

    00:38:54:01 - 00:39:20:06
    Amogha Machinahalli Srikanta
    So we take, many factors into consideration. First thing is looking at the past history of it. But having said that, we have limited reporting periods. But, that was we I don't think that was a constraint. I like that we worked with what we had, so, that we checked the delay with that. And then, there was also how, which top features are influencing that delay, like for example, is it like budget?

    00:39:20:06 - 00:39:39:02
    Amogha Machinahalli Srikanta
    Is it like the area? What is it? Is it the budget? Is it the density? So that was really nice. And that is the predictive analysis that they worked on. [Jie Ren: Yeah. I'm like so looking forward to your presentation in the City Hall next week about that. So we talk about the data analysis right predicting model and also the data complexity.]

    00:39:39:02 - 00:40:06:06
    Jie Ren
    Right. And also you and also Faris. And also Kamrul have agreed. Right. On this. And then so now are the questions that you kind of like, students are trying to understand the background to the context of the operations and in order to ask questions. So how did that process go for you? [Amogha: I think firstly, when the data set came to us, it was very raw and we had to, do a lot of pre-processing on it.]

    00:40:06:07 - 00:40:31:00
    Amogha Machinahalli Srikanta
    And I would like to say that we learned a lot from that pre-processing about the data more than doing the analysis, because by the time we came to the analysis, we had almost like understood what is happening in the. [Jie Ren: Yes. Not easy.] Yes. It was like that. And I think, how we came up with the questions was first, our first basic question when we saw the data was like, okay, what can be our unique identifier such that it can count how many unique projects are there?

    00:40:31:02 - 00:40:50:09
    Amogha Machinahalli Srikanta
    And then moving on, we went more to like, okay, how many sponsor agencies are there and which sponsor agencies go for, which type of projects and which boroughs. So we went more into deep after each and each, review of the dataset. And then we started cleaning the data set simultaneously. And we after that we went into the predictive analysis.

    00:40:50:09 - 00:41:09:21
    Amogha Machinahalli Srikanta
    Descriptive analysis, everything. The questions that we came up with, I think it just came out naturally. Well, we just kept asking Terri and Will after every week like, okay, I don't know, like, how is this coming is borough related to, like, the, delay or is it related to the payment? Like, how is it happening, the budget variance, where does that come from?

    00:41:09:23 - 00:41:28:00
    Amogha Machinahalli Srikanta
    And then we, try to make many more columns. We created new, variables, which was like, one if it's like maybe the total spend, but, we found the percentage of it or things like that. [Jie Ren: So, thank you so much. These are like a wonderful, wonderful, like, sharing with the, all the audience. Thank you.]

    00:41:28:01 - 00:41:30:00
    Amogha Machinahalli Srikanta
    Thank you so much, Professor. [Jie Ren: Absolutely.]

Also available on

Apple Podcasts | YouTube | Spotify | Amazon Music/Audible