Doing Data Science for [Alysum Analysis at Humans Rights First]
Not to get too political but immigration has become a predominant topic in the current climate of Political Science. The reforms for immigration having two contradictory perspectives. One believes in the prevention of immigration the other perspective believes in the expansion of immigration. Our project tries our best to take only a factual analysis of the topic.
HRF needs a web tool backed by data science to aggregate data on asylum cases, allow users to explore that data, and predict and visualize how a judge might rule on a specific asylum case — as well as what specific elements of an asylum case seem to most impact a favorable or unfavorable ruling. This will take multiple Labs cohorts to complete.
For this initial round, we’ll focus on scraping aggregated PDFs as well as individual PDFs of case files to extract plaintext and the name of the judge who made the ruling.
Planning Phase — (Trello and Google’s Diagram)
Our team learned the importance of making a plan using schemas for our user’s flow, databases, and architecture for the future web tool. Using resources such as google diagram.net and Trello for tasks. This allowed us to be more efficient in dividing up the formidable tasks. For our data science team, we discussed the differences between type A or B data science and figured out how to communicate best with each other. The article above from our stakeholder Robert Chang I believe gives the best explanation of these two concepts.
I like to break every user story into small step by tasks to complete the user stories. For instance, with the user story associated with the administrator being able to upload files to our server. I broke it down into ten smaller checklist tasks, divided by if DS is assigned or Web. I believe it would take less than a week to complete. The user story is well thought through and organized. So giving the team around three days to complete the task would be reasonable.
Technical Planning — Database’s
PostgreSQL on AWS(RDS) has been chosen as the database. It’s debatably the most reliable and efficient database. We need this instead of a NoSQL database due to the multiple relationships between the data. For example, a bookmarked case table has a relationship with each user table and each of our Judges in the judge’s table can have multiple relationships with case’s in the Alysum case table. The risk with our database would be a security data breach. All of these cases are supposed to be private so this would be a challenge we might face. Figuring out how to scrape a judge’s name via signature form pdf.
After talking to our web tool team, I made an executive decision to store the processes pdf’s in AWS(S3). Using Fast API endpoint for the web we were able to make a post request that the web team would send a pdf file and it gets automatically scraped and then information extracted added to our tables. The pdf is added to our DS API using FastAPI and then boto3 — AWS tool was pushed into our private cloud bucket. Then set up a simple script that will delete the local file for space efficiency.
I was the representative for our group this week and my main focus was to make sure we are well organized. So I worked on the framework for our pdf database on AWS S3 buckets but my team member’s worked on scraping each pdf for relevant data. As the representative when we have meetings, I want to hear what everyone is working on and any blockers that are affecting the timeline of completion. For our web team making sure they have all data and information provided when asked.
Working on a team takes a lot of communication and planning, so our team is as efficient as possible by not overlapping our progress. So splitting up each task into steps that each one of us can connect to solve the task at hand. Making sure I don’t delay the speed of the web team and providing any information needed that delays their process of completing their tasks. Making sure I am reviewing recent pull requests and approving them with complimentary feedback. Making sure that I don’t start talking until the other person is done talking even when zoom cuts out.
Technical Process — Natural Language Processing and Selenium/Scrapy
For finding the judge's name we needed to first convert the PDFs into free text using “Pdfminer. six”. With work out great and did most of the work we needed for that formidable task. Using Spacy-en_core_web_sm we could load in produce a Natural Language Processor that did entity reionization's in our recently converted free text.
Technical Challenges
The case file upload is responsible for all data on the web end and will populate tables of cases and judges. Each upload might have a compilation of links on it which solves the user having to upload each case pdf one at a time. Selenium will grab each pdf and download it then add it to the AWS S3 bucket database for pdfs. The user can view their judge ruling on the pdf for their case. Each case and the judge will have useful data associated with each case or judge. The data science team does this by receiving each uploaded file from AWS S3 buckets and scraping it for relevant keywords using NLP-spacy. Each Judge’s name is run through a script that web scrapes relevant data like how they got his position, which president administration appointed the Judge, which department does this Judge work for, and what county is the Judge’s court.
I believe we got a great start on this project like setting up the outline for the compilation of MVP in a good timeframe. The main challenge is the resources we have been provided with have anti automated processes built inside them. I did see this challenge but I didn't see coming up with alternative processes to offer to the stakeholders to add data. For example, only uploading each pdf instead of an index of pdfs. The challenge is just the index pdf links you to a website, not a pdf.