M1103 Problem Set 3

In this problem set we will write a program that automatically generates tweets by Donald Trump.

We will proceed as follows: from existing tweets, we will produce new tweets that are “false” but credible because they will be built in a coherent way on the basis of excerpts from existing tweets. For instance, on the basis of the two following tweets

The dog eats the biscuits.

The cat eats the mice.

we will generate a new tweet in the following way:

  1. Randomly select a possible beginning: the first two words of an existing tweet chosen randomly. Given the example we randomly select for instance The dog. This initializes a pair of words, here "The" and "dog", that we will call current pair.

  2. Given a current pair, we randomly select a possible word to continue the sentence : a word that already follows the words of the pair in existing tweets. With the example, after the pair The dog, the only possible word (because already seen in the tweets) is eats. We therefore choose this word and the current pair becomes dog eats. Similarly, after this pair, the only possible following word is the, and the current pair therefore becomes eats the. Now, given this pair, two following words are possible: biscuits or mice. We are therefore going to randomly choose one of these two words, for instance, mice. The current pair becomes the mice. There are now no more possible following word, and we stop the generation of the sentence.

We have therefore produced the “false” tweet:

The dog eats the mice.

This example is very simple; at the end of this lab, you will generate “false” tweets on the basis of a set of 1000 tweets that were actually posted on Donald Trump’s account.

Download the file code.zip and add its content (the files pair.cpp, pair.h, distribution.cpp, distribution.h, chain.cpp, chain.h, test.cpp, asserts.h) to a new Code::Blocks project Move the file tweets.txt to the main folder of your project.

You can test your functions by executing the program whose main function is defined in the file test.cpp.

Problem 1 — Pairs of Strings

As seen above, we need to handle pairs of consecutive words. To do so we will define a data structure to store such a pair. This data structure will be the Pair class used in the files pair.h and pair.cpp which we will write in this first exercise.

a — Encapsulating Two Strings

The first task is to encapsulate the two strings in the objects of the class Pair. We will thus write a constructor that takes two strings, and two getter functions to be able to access them.

Write the implementation of the constructor in the class Pair and the two member functions get_first and get_second. The function get_first returns the first string passed to the constructor, the function get_second returns the second.

b — Equality of Pairs

We will store pairs of strings in vectors. To be able to find the again, we will implement an equality relation for pairs.

Write the implementation of the operator == for instances of the class Pair. It returns true if and only if the first and the second strings are equal.

Problem 2 — Random Words

To guess the next word in a tweet, we do a random choice of the word in a list of possible words. This will be done in the class Distribution in the files distribution.h and distribution.cpp.

This class will store a vector of words, and you will define the member functions allowing to add words to this vector.

To respresent efficiently a list of words where words can occur multiple times, we need two vectors:

a — Adding Possible Words

Every instance of the class Distribution contains a list of possible words, in which we will randomly choose an element. To start, we will write a function to add words to this list and a second function to count the number of times that we added a given word.

Write the implementation of the member functions add_string and get_count. The function get_count returns the number of times that add_string was called with the string given as an argument.

b — Total Count of Words

We want to choose a given word in the list of possible words with a probability proportional to the number of times it was added. For that, we need to know the total number of times that words were added.

Write the implementation of the member function get_total_count. It returns the number of times that add_string was called.

c — Random Choice

The member function random of the class Distribution returns a random word in the list of words that were added. A word that was added \(n\) times is chosen with probability \(n/T\) where \(T\) is the total count of words in the list.

Write the implementation of the member function random so that it returns a random element that was previously added, with a probability proportional to the number of times that the word was added.

If the total count is zero, i.e., no string was added, return the string “END”.

Problem 3 — Constructing a Tweet

We will now implement the class Chain (in the files chain.h and chain.cpp) which keeps for each possible pair of words, an instance of the class Distribution. The distribution for a given pair determines the choice of the next word.

We will keep in a data member an instance of the Pair class corresponding to the current pair (the two most recently chosen word).

After an initial choice of the first two words, we will iteratively choose a word to add to the end of the current tweet. We will update the data member taking into account the newly chosen word.

a — Finding the First Two Words

To start a tweet, we need to determine the first two words. For that, we will store in the Chain a vector of possible Pairs to start a tweet with (namely those used by Donald Trump at the beginning of a tweet).

We will add possible pairs (without duplicates) for the start of the tweet with the member function add_start_pair. The member function init chooses randomly a pair that was added with add_start_pair. We want to be able to inspect the current pair (i.e., the pair of the two most recent words of the tweet) by calling the member function get_current.

With the preceeding example, the parameter of the method add_start_pair will therefore be successively replaced with the pair The dog and the pair The cat. The method init() will randomly choose one of these two pairs in the vector, with the example The dog, and store it in the current pair. Finally, the method get_current will return this current pair that has been stored.

Write the implementation of the member functions add_start_pair, init, and get_current described above. Every added pair should have the same probability to be chosen in the function init. Before having added pairs with add_start_pair or before having called init, the function get_current can return an arbitrary Pair.

b — Finding the Next Word

Now that we know of to choose the first two words of the tweet, we will iteratively choose the next word, depending on the preceding two words. The member function add_string_to_pair takes a string and a Pair as arguments. It should add the given string to the distribution corresponding to the given pair. For that it will be necessary the store not only a vector of pairs, but also a vector of distributions. If the given pair is not already stored in the vector, we will need to add it to the vector and create a new distribution for the pair.

The function next returns the next word according to the function random of the distribution of the current pair, i.e., the pair returned by the function get_current.

It also updates the current pair with the new word. If there is no possible next word, it returns the string “END”.

With the above example, the parameters of the method add_string_to_pair will therefore be successively replaced by:

At the end, the vector of stored pairs will therefore contain

The dog dog eats eats the The cat cat eats

and the vector of distributions will contain

"eats" "the" "biscuits","mice" "eats" "the"
1 1 1 , 1 1 1

The method next will look for the distribution corresponding to the current pair, randomly select a word in this distribution, and update the current pair. With the example, when the current pair is eats the, the corresponding distribution is

"biscuits","mice"
1 , 1

and the method will therefore randomly select between these two following words.

Write the implementations of the member functions add_string_to_pair and next described above.

c — Complete Tweets

You code, together with the function read and the loop at the end of the main function which we have written for you, now allows to generate new tweets on the basis of a data base of tweets stored in the file tweets.txt. When running the program, you will see randomly generated tweets.

back to the course web site