Ramgopal Prajapat:

Learnings and Views

ML Based Startups’ Investment Proposal Screening System

Aug 01, 2020

Project Summary/Context

Investors receive thousands of proposals to read, review, and respond. There are many tips, templates, and techniques to create a perfect investment proposal. In this project, we aim to build a proposal summary screening system which can helps startup to get a likelihood score (indicating chances of investor responding positively for next meeting or discussion) for their proposal and this system also helps investors to save their valuable time. They can focus on meeting & discussing with startup founders instead of reading proposal emails. Also, this brings consistency, transparency, and confidence in the proposal screening process.

High Level Steps


A startup prepares an investment proposal and submits to single or multiple VC investors. The VC investment team screens and respond back to the startup if they interested in the next level of discussion.


Startup: Information about a startup submitting a proposal to an investor.  There can be multiple rows for the same information. E.g. Scenario 1: Startup A submitting one proposal to investor VC1 in Jan-2019 and other proposals to the same VC in Oct-2019.  Scenario 2: Startup A submitting a proposal to investors VC1 and VC2, then also there will be two entries. So, the startup data is linked to the proposal. Startup data has a proposal ID and VC ID as columns that be useful for linking to other tables.

Proposal: This has one data captured about a proposal. Mainly body of text and decision on the proposal

Investor: Investor information and key industry and sector focus information.

Data Preparation

The decision here is whether to have the next level of discussion for a proposal at that point in time. 

There can be multiple rows for a single company for a given time. A company A may submit an investment proposal summary to 20 investors but company B may have submitted it to the 30 investors.

Also, a company P2 may have submitted a proposal to an investor in Feb-2019, Dec-2019, and Mar-20, and every time the business summary may be very different. So, the table will have 3 different rows for the same company & investor IDs.

Also, the screening rate is around 5% (Only 1 in 20 proposals are responded to for the next round of discussion).  So, the sample is highly unbalanced. So, we created a biased sample – select all for 1 (when screening decision is Yes) and only some number of rows from 0 (when investment screening decision is NO). But the rows from non-screened segmented were select randomly.

Bag of words and Words Embedding is used for creating features from the proposal summary text corpus.


Features related to the startups - # of founders, months of operations, Size of team, and many more and features created from the summary proposal text are used for building the model.

The decision is whether to screen for the next round of discussion.

XGBoost and CNN are used for developing the prediction model.


The startup ecosystem by design is ever-evolving. But the fundamental dimensions used for screening and assessing a startup idea is similar. To keep the balance, we designed the system to split the volume of emails between the system and the business analysts - BAs. Initially more to BAs and less to the system. Over a period in time, this is expected to change in favor of the system based screening.


Additional thoughts:

An investment decision is made at a point and the context of that time plays a very important role. So, consider all information available at that point in time. For example, the tone and enthusiasm from various news articles can play an important input.

The startup must be tracked over a period in time to assess whether the screened rejection was the right decision or not by a VC investor.

Also, we may want to estimate CAGR especially for advanced staged Startup based on all the information available at that point in time. This can be an additional dimension for the discussions and decisions.

A startup has multiple dimensions and a comprehensive input data model can help in designing a better system. Some of the data dimensions to be considered are:


Business Outcome

An automated system based on a high magnitude of dimensions and factors can be a powerful investment proposal screening system. This system can save a huge amount of time and money for the investors.